PhD. Manuel Alfredo PECH PALACIO. On December 12 th, 2005

Transcription

1 2005 ISAL PhD Jointly awarded at the Institut National des Sciences Appliquées de Lyon (Ecole Doctorale Informatique et Information pour la Société) and the Universidad de las Américas, Puebla (School of Engineering, Department of Computer Sciences) By Manuel Alfredo PECH PALACIO On December 12 th, 2005 "Spatial Data Modeling and Mining using a Graph-based Representation" PhD Committee Chair, Pr. Eduardo Morales Manzanares, ITESM, Morelos Tutors: Pr. Robert Laurini, INSA of Lyon Dr. Anne Tchounikine, INSA of Lyon Dr. David Sol Martínez, UDLAP Dr. Jesús A. González Bernal, INAOE Reviewers: Pr. Hervé Martin, U. Joseph Fourier, Grenoble Dr Nicandro Cruz Ramírez, U. Veracruzana Examiner: Dr. François Fages, INRIA

2 ABSTRACT Motivation Several approaches have been developed for mining spatial data (i.e., generalization-based methods, clustering, spatial associations, approximation and aggregation, mining in image and raster databases, spatial classification and spatial trend detection). However, we argue that these approaches do not consider all the elements found in a spatial database (spatial data, non-spatial data and spatial relations among the spatial objects) in an extended way. Some of them focus first on spatial data and then on non-spatial data or vice versa, and others consider restricted combinations of these elements. We think that it is possible to enhance the generated results of the data mining task by mining them as a whole and not as separate elements (they are related elements). A graph representation provides the flexibility to describe these elements together and this is the motivation to explore the area of graph-based spatial knowledge discovery. I

3 Proposal Our idea is to create a unique graph-based model to represent spatial data, non-spatial data and the spatial relations among spatial objects. We will generate datasets composed of graphs with a set of these three elements. We consider that by mining a dataset with these characteristics a graph-based mining tool can search patterns involving all these elements at the same time improving the results of the spatial analysis task. A significant characteristic of spatial data is that the attributes of the neighbors of an object may have an influence on the object itself. So, we propose to include in the model three relationship types (topological, orientation, and distance relations). In the model the spatial data (i.e., spatial objects), non-spatial data (i.e., non-spatial attributes), and spatial relations are represented as a collection of one or more directed graphs. A directed graph contains a collection of vertices and edges representing all these elements. Vertices represent either spatial objects, spatial relation types between two spatial objects (binary relation), or non-spatial attributes describing the spatial objects. Edges represent a link between two vertices of any type. According to the type of vertices that an edge joins, it can represent either an attribute name or a spatial relation name. The attribute name can refer to a spatial object or a non-spatial entity. We use directed edges to represent directional information of relations among elements (i.e., object x covers object y) and to describe attributes about objects (i.e., object x has attribute z). We propose to adopt the Subdue system, a general graph-based data mining system developed at the University of Texas at Arlington, as our mining tool. Subdue discovers II

4 substructures using a graph-based representation of structural databases. The substructures (a connected subgraph within the graphical representation) describe structural concepts in the data (i.e., patterns). The discovery algorithm follows a computationally constrained beam search. The algorithm begins with the substructure matching a single vertex in the graph. On each iteration, the algorithm selects the best substructure and incrementally expands the instances of the substructure. An instance of a substructure in the input graph is a subgraph that matches (graph theoretically) that substructure. A special feature named overlap has a primary role in the substructures discovery process and consequently a direct impact over the generated results. However, it is currently implemented in an orthodox way: all or nothing. If we set overlap to true, Subdue will allow the overlap among all instances sharing at least one vertex. On the other hand, if overlap is set to false, Subdue will not allow the overlap among instances sharing at least one vertex. So, we propose a third approach: limited overlap, which gives the user the capability to set over which vertices the overlap will be allowed (vertices representing remarkable elements that refer, for instance, to a spatial object in a spatial database or to some characteristic defining a particular topic of a dataset). We visualize directly three motivations issues to propose the implementation of the new algorithm: search space reduction, processing time reduction, and pattern oriented search. III

5 Contribution The contribution to the discovery knowledge in the spatial data domain, described in this dissertation, is the development of a new approach for spatial data modeling and mining using a graph-based representation. This contribution includes the following results: A new graph-based data representation for spatial, non-spatial data and spatial relations. A new algorithm to discover substructures using a limited overlap approach in the Subdue system. A prototype system implementing the proposed model. IV

6 Acknowledgments Thank you; you are my guide, motivation and inspiration: Erandi, my darling wife Paula Ireri, my wonderful daughter Manuel André, my little son Manuel and Elsy, my parents I want to express my gratitude, respect and admiration my tutors for the time, advices and support they granted me during the research: Mexican tutors: Dr. David Sol Martínez Dr. Jesús A. González Bernal Gracias French tutors: Pr. Robert Laurini Dr. Anne Tchounikine Merci V

7 This works was supported in part by the Excellent Graduate Scholarship from the Fundación Universidad de las Américas, Puebla, H project (Habitar y vivir. Análisis del espacio habitacional de la ciudad de Puebla ) Scholarship from the Mexican Council of Science and Technology, SIRPO project (Sistema de Información para los Riesgos del Popocatépetl) research grand from the Laboratoire Franco-Mexicain d'informatique, the Institut National des Sciences Appliquées de Lyon, and Excellent Scholarship from the Gobierno del Estado de Quintana Roo. VI

8 Table of contents ABSTRACT... I Motivation... I Proposal... II Contribution... IV Appendix A RESUMEN EN ESPAÑOL...i A.1. Introducción...i A.1.1 Métodos basados en generalización... ii A.1.2 Agrupamiento... iii A.1.3 Asociaciones espaciales...iv A.1.4 Aproximación y agregación...iv A.1.5 Minería de datos en imágenes...v A.1.6 Clasificación de datos espaciales...v A.1.7 Detección de tendencias espaciales...vi A.2. Motivación...vi A.2.1 Relaciones espaciales... vii A.3 Representaciones basadas en grafos... viii VII

9 A.4 Minando el grafo...xix A.5 Resultados... xxii A.6 Conclusiones...xxx A.7 Contribución... xxxiii Appendix B RÉSUMÉ EN FRANÇAIS...xxxv B.1. Introduction...xxxv B.1.1 Méthodes basées sur la généralisation...xxxvi B.1.2 Regroupement... xxxvii B.1.3 Associations spatiales... xxxviii B.1.4 Rapprochement et agrégation... xxxviii B.1.5 Fouille de données-images...xxxix B.1.6 Classification de données spatiales...xxxix B.1.7 Détection de tendances spatiales...xl B.2. Motivation...xl B.2.1 Relations spatiales... xli B.3 Représentations basées sur des graphes... xlii B.4 Fouille du graphe... liii B.5 Résultats... lvi B.6 Conclusions... lxiv B.7 Contribution... lxvii Chapter 1 INTRODUCTION Motivation Proposal Contribution...6 VIII

10 1.4 Organization of the thesis...7 Chapter 2 RELATED WORK Geographic Information System (GIS) Data Mining Spatial Data Mining Generalization-based Method Clustering Spatial Associations Approximation and Aggregation Mining an Image Database Classification Learning Spatial Trend Detection Spatial relations Neighborhood Graphs, Neighborhood Paths and Neighborhood Indices Topological Relations Distance Relations Direction Relations Conclusion...30 Chapter 3 GRAPH-BASED REPRESENTATIONS Generalities Methodology Spatial Graph-based Data Representations Use-case Conclusion...58 IX

11 Chapter 4 MINING THE GRAPH Characteristics Main Functions Overlap Limited Overlap Conclusion...94 Chapter 5 PROTOTYPE Population Census from the year of 1777 in Puebla downtown Modules Conclusion Chapter 6 RESULTS Population census from the year of 1777 in Puebla downtown Use-case: El Sagrario Use-case: People living along the borders of the river crossing Puebla downtown Popocatépetl volcano Use-case: Popocatépetl Model #1 - base model Model #2 - single replication of relation types, complete information Model #3 - double replication of relation types, no complete information Model #4 - single replication of relation types, no complete information Model #5 - double replication of relation types, complete information Conclusion Chapter 7 CONCLUSIONS X

12 BIBLIOGRAPHY Appendix C AGREEMENTS UDLAP/INSA DE LYON... lxix C.1 CONVENTION DE CO-TUTELLE DE THÈSE...lxx C.2 AVENANT RELATIF A LA CONVENTION DE THÈSE EN CO-TUTELLE DE MANUEL ALFREDO PECH PALACIO... lxxiii XI

13 List of figures Figura A.1. Modelo basado en grafos para representar datos espaciales....ix Figura A.2. Base de datos de ejemplo para caracterizar los 3 modelos propuestos....xi Figura A.3. Modelo #1 - modelo base.... xii Figura A.4. Modelo #2 - replicación simple de tipos de relación, información completa. xiii Figura A.5. Modelo #3 - doble replicación de tipos de relación, información no completa.xv Figura A.6. Relaciones entre carreteras y ríos usando el modelo #1...xxiv Figura A.7. Relaciones entre carreteras y poblaciones usando el modelo #1...xxv Figura A.8. Relaciones entre ríos y poblaciones usando el modelo #1.... xxvii Figure B.1. Modèle basé sur des graphes pour représenter des données spatiales... xliii Figure B.2. Base d'exemple pour caractériser les 3 modèles proposés...xlv Figure B.3. Modèle n 1 - modèle base.... xlvi Figure B.4. Modèle n 2 - réplication simple des types de relation, information complète.... xlvii Figure B.5. Modèle n 3 - double réplication des types de relation, information non complète... xlix Figure B.6. Relations entre des routes et des rivières en utilisant le modèle n 1... lviii XII

14 Figure B.7. Relations entre des routes et des villes en utilisant le modèle n lix Figure B.8. Relations entre des rivières et des villes en utilisant le modèle n lxi Figure 2.1. Geographic Information System Figure 2.2. Knowledge Discovery in Databases...13 Figure 2.3. Data mining: integration of several fields Figure 2.4. Architecture for a KDD system...15 Figure 2.5. Example of the interior, boundary and exterior of a circle Figure 2.6. Nine intersection model...27 Figure 2.7. Topological Relations...28 Figure 2.8. Distance Relations...28 Figure 2.9. Direction Relations...30 Figure 3.1. General graph-based model to represent spatial data...35 Figure 3.2. Sample dataset...40 Figure 3.3. Model #1 - base model Figure 3.4. Model #2 - single replication of relation types, complete information...43 Figure 3.5. Model #3 - double replication of relation types, no complete information...45 Figure 3.6. Model #4 - single replication of relation types, no complete information...47 Figure 3.7. Model #5 - double replication of relation types, complete information...48 Figure 3.8. Spatial database representing some objects of the world...53 Figure 3.9. Selection window Figure Querying a spatial database...55 Figure Graph-based representation for spatial data Figure 4.1. Graph representation of the house domain...61 Figure 4.2. Substructure and instances discovered from the house domain by Subdue XIII

15 Figure 4.3. Substructure replacement procedure in the house domain...62 Figure 4.4. Graph representation of the house domain after substructure replacement Figure 4.5. SGISO - input graph_ Figure 4.6. SGISO - input graph_ Figure 4.7. SGISO - no overlap Figure 4.8. SGISO - no overlap, 1 instance in graph_ Figure 4.9. SGISO - overlap Figure SGISO - overlap, 4 instances in graph_ Figure MDL - input graph_ Figure MDL example 1 - input graph_ Figure MDL example 1 - no overlap Figure MDL example 2 - input graph_ Figure MDL example 2 - overlap Figure MDL example 3 - input graph_ Figure MDL example 3 - overlap Figure MDL example 4 - input graph_ Figure MDL example 4 - overlap Figure MDL example 5 - input graph_ Figure MDL example 5 - overlap Figure MDL example 6 - input graph_ Figure MDL example 6 - overlap Figure MDL example 7 - input graph_ Figure MDL example 7 - overlap Figure MDL example 8 - input graph_ XIV

16 Figure MDL example 8 - overlap Figure MDL example 9 - input graph_ Figure MDL example 9 - overlap Figure Limited overlap - input graphs PS_1 and PS_ Figure Limited overlap - input graph_ Figure No overlap - SGISO Figure Overlap - SGISO Figure Limited overlap PS_1 - SGISO...84 Figure Limited overlap in the Subdue system Figure No overlap - compressed graph...88 Figure No overlap - discovered substructures Figure Overlap - compressed Graph...91 Figure Overlap - discovered substructures Figure Limited overlap PS_1 - compressed graph...93 Figure Limited overlap - discovered substructures...94 Figure 5.1. Representation of spatial concepts in the census from the year of Figure 5.2. First representation for the non-spatial data in the population census from the year of Figure 5.3. Second representation for the non-spatial data in the population census from the year of Figure 5.4. The query panel Figure 5.5. The map panel Figure 5.6. The spatial graph panel Figure 5.7. Graph representation of processed data XV

17 Figure 5.8. The non-spatial graph panel Figure 5.9. The Subdue panel Figure Example of Subdue s standard output Figure Layout for reading the Subdue s discovered substructures Figure 6.1. Population census from the year of 1777 in Puebla downtown Figure 6.2. Parishes in Puebla downtown in the 1777 year Figure 6.3. Blocks 150m. from representative church, parish El Sagrario Figure 6.4 Example of a generated graph in the use-case El Sagrario Figure 6.5. Examples of discovered patterns by using standard overlap in use-case El Sagrario (1) Figure 6.6. Examples of discovered patterns by using limited overlap in use-case El Sagrario (1) Figure 6.7. Processing time standard vs. limited overlap: use-case El Sagrario (1) Figure 6.8. Blocks 150m. North side from representative church, parish El Sagrario Figure 6.9. Examples of discovered patterns in the use-case El Sagrario (2) Figure Processing time standard vs. limited overlap: use-case El Sagrario (2) Figure Blocks 50m. from river crossing Puebla downtown Figure Examples of discovered patterns in use-case people around the river crossing Puebla downtown Figure Processing time standard vs. limited overlap: use-case people around the river crossing Puebla downtown Figure Popocatépetl volcano Figure Popocatépetl volcano: study zone Figure Relationships among roads and rivers by using model # XVI

18 Figure Relationships among roads and settlements by using model # Figure Relationships among rivers and settlements by using model # Figure Relationships among roads and rivers by using model # Figure Relationships among roads and settlements by using model # Figure Relationships among rivers and settlements by using model # Figure Relationships among roads and rivers by using model # Figure Relationships among roads and settlements by using model # Figure Relationships among rivers and settlements by using model # Figure Relationships among roads and rivers by using model # Figure Relationships among roads and settlements by using model # Figure Relationships among rivers and settlements by using model # Figure Relationships among roads and rivers by using model # Figure Relationships among roads and settlements by using model # Figure Relationships among rivers and settlements by using model # XVII

19 List of tables Tabla A.1. Características de los modelos de representación basados en grafos....xvi Tabla A.2. Instancias/iteraciones por cada modelo basado en grafos: caso de uso Popocatépetl... xxviii Tabla A.3. Max/Min de instancias descubiertas por objeto-objeto /característica overlap....xxix Tabla A.4. Promedio de instancias descubiertas por modelo/ objeto-objeto....xxix Tabla A.5. Promedio de instancias descubiertas por modelo/característica overlap...xxx Tabla A.6. Promedio de instancias descubiertas por modelo....xxx Table B.1. Caractéristiques des modèles de représentation basés sur des graphes....l Table B.2. Instances/itérations par chaque modèle basé sur des graphes : cas d'utilisation Popocatépetl... lxii Table B.3. Max/Min d'instances découvertes par "objet-objet"/caractéristique recouvrement.... lxiii Table B.4. Moyenne d'instances découvertes par modèle/"objet-objet".... lxiii Table B.5. Moyenne d'instances découvertes par modèle/caractéristique recouvrement.. lxiv Table B.6. Moyenne d'instances découvertes par modèle.... lxiv XVIII

20 Table 3.1. Characteristics of the graph-based representation models...49 Table 6.1. Instances/iterations by each graph-based model: Popocatépetl use-case Table 6.2. The best model to discover complete patterns among the road and river spatial objects (no overlap) Table 6.3. Max/Min of discovered instances by object-object /overlap feature Table 6.4. Average of discovered instances by model/ object-object Table 6.5. Average of discovered instances by model/overlap feature Table 6.6. Average of discovered instances by model XIX

21 Appendix A RESUMEN EN ESPAÑOL A.1. Introducción En los últimos años hemos sido testigos del rápido crecimiento en el número, capacidad y diseminación de aplicaciones informáticas dedicadas a la obtención, generación, manipulación y almacenamiento de datos en diversos ámbitos de la vida humana. Esto ha propiciado una gran cantidad de colecciones de datos cuyo análisis por medios manuales se vuelve una tarea complicada. Recordemos que en muchas ocasiones los datos crudos necesitan ser analizados e interpretados para convertirlos en información útil y provechosa. Tal situación ha propiciado una creciente necesidad por técnicas/herramientas computacionales que nos ayuden en estas tareas. Descubrimiento de conocimiento en bases de datos (KDD, por sus siglas en el idioma Inglés) es definido como la extracción no trivial de información implícita, previamente desconocida y potencialmente útil a partir de datos [16]. Este es un proceso iterativo e interactivo que envuelve diferentes fases. El núcleo del proceso es la fase de minado de datos, que se conceptualiza como la aplicación de algoritmos de análisis de datos y de descubrimiento que bajo parámetros aceptables de i

22 eficiencia computacional producen/descubren una enumeración particular de patrones sobre los datos mismos [12]. En este mismo contexto, pero enfocado al análisis y explotación de datos provenientes de fenómenos generados en, sobre, y bajo la superficie de la tierra, llamados datos espaciales, ha generado un nuevo dominio de investigación llamado Minería de datos Espaciales. De tal forma, la Minería de datos espaciales se enfoca al descubrimiento de conocimiento implícito, y previamente desconocido en datos espaciales [16]. Como resultado de esta necesidad creciente diversos enfoques para el minado de datos espaciales han sido desarrollados, entre los más representativos encontramos: A.1.1 Métodos basados en generalización La generalización ha demostrado ser uno de los métodos efectivos para descubrir conocimiento. Fue introducido por la comunidad de aprendizaje máquina y se basa en el aprendizaje a partir de ejemplos. El descubrimiento de conocimiento basado en generalización requiere jerarquías de conceptos (dadas explícitamente por el experto ó generadas automáticamente). En la caso de las bases de datos espaciales, pueden darse dos tipos de jerarquías de conceptos: (1) Jerarquías temáticas, por ejemplo, generalizar tomates y plátanos como frutas, las frutas y vegetales como alimentos de origen vegetal. (2) Jerarquías espaciales, por ejemplo, generalizar una serie de puntos espaciales como una región ó país. ii

23 Lu et al. [35] extienden la técnica attribute-oriented induction a las bases de datos espaciales. Esta técnica se basa en escalar la jerarquía de generalización e ir resumiendo las relaciones entre los datos espaciales y no espaciales a un nivel de concepto más alto. Los autores presentan dos algoritmos basados en generalización: (1) Enfoque de dominación de datos no espaciales. Este método realiza en primera instancia inducción orientada al atributo sobre los datos no-espaciales y posteriormente sobre los espaciales. (2) Enfoque de dominación de datos espaciales. Dado la jerarquía de datos espaciales, la generalización se realizado primero sobre estos datos y posteriormente sobre los datos no espaciales. A.1.2 Agrupamiento Agrupamiento (clustering) es el proceso de agrupar de manera física ó abstracta objetos en clases de objetos similares. Este enfoque de minería de datos nos ayuda a construir particiones representativas de un conjunto de objetos dada una medida de similitud/distancia (Ej. distancia euclidiana). Esto es, el agrupamiento de datos identifica grupos (clusters) ó regiones densamente pobladas de acuerdo a alguna medida de distancia en un conjunto de datos multidimensionales. Podemos clasificar a los algoritmos de agrupamiento en cuatro grupos principales: Algoritmos de particionamiento basados en los enfoques k-means (centro de gravedad del cluster) y k-medoid (objeto representativo del cluster), algoritmos jerárquicos, algoritmos basados en la ubicación de los objetos (agrupamiento por densidad), y por último los basados en grids. iii

24 A.1.3 Asociaciones espaciales Una regla de asociación espacial es una regla que describe la implicación de uno o un conjunto de objetos por otro conjunto de objetos en base de datos espaciales [29]. Un ejemplo de una regla de asociación espacial podría ser si la empresa se ubica cerca de la Ciudad de México entonces es una empresa grande. Una regla de asociación espacial es de la forma X Y, donde X y Y son conjuntos de predicados espaciales o no espaciales. Existen varios tipos de predicados espaciales que pudieran constituir una regla de asociación espacial, por ejemplo, relaciones topológicas como son intersección, traslape y orientaciones espaciales tales como Izquierda_de, y Oeste_de. A.1.4 Aproximación y agregación Los métodos basados en aproximación y agregación buscan analizar las características de grupos de objetos (clusters) en base a objetos (features) cercanos a ellos. Proximidad agregada es la medida de cercanía de un conjunto de puntos en un cluster a un feature. La idea de encontrar relaciones de proximidad no es un problema simple como podría parecer, existen tres razones para esta aseveración. Supongamos que tenemos un cluster de puntos y queremos encontrar los k-features más cercanos a él: El tamaño y forma del cluster y los features puede ser muy variado. Podríamos tener una gran cantidad de features para examinar. Aún en el caso de encontrar una forma conocida (Ej. polígono) que describa la forma del cluster, sería inadecuado reportar los features cuyos límites estén más iv

25 cerca a los límites de éste, porque la distribución de los puntos al interior del cluster puede no ser uniforme. A.1.5 Minería de datos en imágenes Extracción de patrones a partir de imágenes es otro enfoque de minería de datos espaciales. En la literatura existen diversos trabajos de minado de datos desarrollados bajo este enfoque. Por ejemplo, Fayyad et al. [14] presentan un sistema para la identificación y categorización de volcanes en la superficie de Venus a partir de imágenes transmitidas por la sonda espacial Magellan. En otro trabajo [15] (Second Palomar Observatory Sky Survey) se usaron árboles de decisión para la clasificación de galaxias, estrellas y otros objetos estelares. Stolorz et al. [44] y Shek et al. [41] efectuaron estudios sobre minería de datos espacio-temporal en conjuntos de datos geofísicos. A.1.6 Clasificación de datos espaciales La clasificación de datos espaciales tiene como objetivo encontrar reglas que dividan un conjunto de objetos en un número de grupos, donde los objetos de cada grupo pertenecen a una clase. Diversos tipos de información pueden ser usados para caracterizar los objetos espaciales. Por ejemplo, atributos no espaciales de un objeto, predicados espaciales y funciones espaciales. La idea es usar esta información para extraer ya sea atributos para la etiquetación de clases (atributos que dividen los datos en clases) y atributos predictivos (atributos cuyos valores son usados en un árbol de decisión para crear sus ramas). v

26 A.1.7 Detección de tendencias espaciales Detección de tendencias espaciales describe los cambios regulares de uno ó más atributos no espaciales de un objeto cuando éste se desplaza desde un punto de referencia inicial. Un ejemplo de tendencia espacial sería alejándose del centro histórico de la ciudad de Puebla es precio de los terrenos decrece. Las trayectorias de movimiento a partir de un punto x son usadas para modelar dicho movimiento y análisis de regresión sobre los atributos de los objetos son usados para describir patrones de cambio. Existen dos tipos de tendencias: globales y locales. A.2. Motivación Aunque los enfoques antes mencionados realizan la tarea de minado de datos espaciales de manera exitosa, nuestra percepción es que éstos no consideran todos los elementos encontrados en una base de datos espaciales (datos espaciales, datos no espaciales y relaciones espaciales entre los objetos espaciales) de una manera integral. Es decir, algunos de ellos primero realizan minado de datos espaciales y posteriormente minado de datos no espaciales ó viceversa, y otros permiten combinaciones de estos elementos pero de manera restringida. Con base en lo anterior, proponemos el argumento siguiente: si somos capaces de representar los datos espaciales, no espaciales y las relaciones entre objetos espaciales como un solo conjunto de datos, y lo minamos como tal, podríamos generar/encontrar patrones de conocimiento que describan/caractericen nuestro conjunto de datos conteniendo estos tres elementos de manera conjunta. Para tal efecto en este trabajo se argumenta que vi

27 una representación basada en grafos es lo suficientemente flexible y poderosa para representar estos elementos de manera conjunta, fácilmente entendible y capaz de crear diferentes representaciones del mismo dominio. El dominio es descrito usando grafos, los grafos se convierten en datos de entrada para una herramienta de descubrimiento de conocimiento basada en grafos la cual usa heurísticas para seleccionar subgrafos que son considerados importantes (patrones). Un grafo es definido como un par G = (V,E). V = {v 1,,v n } denota un conjunto finito de elementos llamados vértices. E es un conjunto de arcos e satisfaciendo E [V] 2. Entonces, cada arco e E es un par (v i,v j ). Si (v i,v j ) es un par ordenado para cualesquiera (v i,v j ) E, entonces se dice que G = (V,E) es un grafo dirigido. Un grafo etiquetado tiene etiquetas asociadas a sus vértices y arcos. Como comentamos anteriormente en nuestro modelo proponemos la representación de relaciones espaciales entre los objetos espaciales. En la siguiente subsección detallamos los tres tipos de relaciones espaciales que proponemos incluir. A.2.1 Relaciones espaciales La ubicación explícita de los objetos espaciales define relaciones implícitas de vecindad (neighborhood) espacial entre ellos. De tal forma, la información sobre la vecindad de los objetos espaciales constituye un elemento valioso que debe ser considerado en la tarea de minado de datos espaciales. Martin Ester et al. [9][11] introducen el concepto de grafos de vii

28 vecindad para representar explícitamente estas relaciones de vecindad implícitas. Los grafos de vecindad pueden cubrir las relaciones de vecindad siguientes: Topológicas. Derivadas del modelo de nueve intersecciones [6][7][8], son relaciones que permanecen invariantes bajo transformaciones lineales, es decir, si ambos objetos se rotan, se trasladan ó se escalan simultáneamente las relaciones entre ellos se preservan. Distancia. Compara la distancia entre dos objetos dada una constante usando operadores aritméticos tales como <, >, =. La distancia entre dos objetos se definen como la distancia mínima entre ellos (Ej. seleccionar todos los elementos dentro de una radio de 50 kilómetros desde un punto x). Dirección. La relación espacial de dirección entre 2 objetos espaciales A y B (B R A) se define usando un punto representativo del objeto A y todos los puntos del objeto destino B. El punto representativo del objeto fuente A puede ser el centro del objeto ó un punto sobre sus límites. Este punto representativo es usado como el origen de un sistema de coordenadas virtuales y su cuadrante define la dirección. Una vez comentada la motivación de nuestro trabajo de investigación se presenta a continuación el modelo general basado en grafos para representar los datos espaciales. A.3 Representaciones basadas en grafos Como hemos comentado, nuestra propuesta se basa en crear un modelo basado en grafos para representar conjuntamente datos espaciales, no espaciales y relaciones espaciales entre viii

29 los objetos espaciales. La idea es usar los grafos generados como datos de entrada para un algoritmo de minado de datos, de tal forma, que el algoritmo pueda encontrar patrones involucrando estos elementos de manera conjunta y no como elementos separados. En consecuencia, proponemos el modelo de representación mostrado (en notación UML) en la Figura A.1. Topological Distance Direction 1..* Spatial relation -From: -To: Edge Vertex Spatial object 1..* Non-spatial attribute Figura A.1. Modelo basado en grafos para representar datos espaciales. En el modelo, los datos espaciales (Ej. objetos espaciales), datos no espaciales (Ej. atributos descriptivos), y relaciones espaciales son representados como una colección de uno ó más grafos dirigidos etiquetados. Los vértices pueden representar objetos espaciales, tipos de relación espacial entre dos objetos (relación binaria), ó atributos no espaciales describiendo los objetos espaciales. Los arcos representan una liga existente entre dos vértices de cualquier tipo. Dependiendo del tipo de vértices que un arco une, éste puede representar el nombre de un atributo descriptivo ó el nombre de una relación espacial. Los nombres de atributos pueden referirse a descripciones de objetos espaciales y/ó a entidades no espaciales. Se usan arcos dirigidos para representar la información direccional de relaciones ix

30 entre elementos (Ej. objeto x cubre objeto y) y para describir atributos pertenecientes a objetos (Ej. objeto x tiene atributo z). Actualmente se han desarrollado cinco representaciones a partir del modelo general previamente descrito. Tres tópicos definen las características de los grafos creados en cada modelo: (1) Representación de las relaciones espaciales equivalentes (Ej. tocar, traslapar). (2) Representación de las relaciones espaciales simétricas (Ej. contiene-dentro_de, Norte_de-Sur_de). (3) La manera de representar los objetos y sus relaciones en el modelo. En consecuencia, los grafos creados se diferencian de manera cuantitativa y cualitativa. En la parte cuantitativa tenemos diferencias tales como: número de vértices y arcos empleados para representar los datos espaciales, no espaciales y las relaciones, creación de grafos simples, manejo de arcos dirigidos y no dirigidos para representar relaciones espaciales y/o atributos descriptivos. En la parte cualitativa hemos observado, a través de experimentación, que ciertos modelos tienen una mayor expresividad para representar el conjunto de datos, condicionando de manera directa la calidad de los resultados generados en el proceso de minado. A continuación se presentan 3 de los 5 modelos propuestos describiendo las métricas de evaluación creadas para caracterizar a cada uno de ellos. Modelos Con la finalidad de describir las características de cada modelo, usaremos como conjunto de datos de ejemplo los mostrados en la Figura A.2. Como podemos observar, nuestro conjunto de datos se compone de dos objetos espaciales, objeto A representado una casa y objeto B representando un lago, y las tres relaciones espaciales siguientes: (1) Distancia, x

31 objeto A cerca de objeto B (relación equivalente). (2) Topológica, objeto A toca objeto B (relación equivalente). (3) Dirección, objeto A Sur de objeto B (relación simétrica). Object B North West East Object A South Figura A.2. Base de datos de ejemplo para caracterizar los 3 modelos propuestos. Modelo #1 - modelo base La Figura A.3 muestra el primer modelo creado para representar datos espaciales bajo el enfoque propuesto. Las características del modelo de acuerdo a las métricas creadas para su caracterización son: Num. vértices: 2 vértices, cada uno representando un objeto espacial (objeto A y objeto B). Num. arcos: 4 arcos, 3 arcos para representar las relaciones espaciales originales existentes en nuestro conjunto de datos de ejemplo ( cerca, toca y Sur_de ) y un arco para representar la relación Norte_de creada de la relación simétrica original Sur_de. La relación Norte_de es en si misma una relación simétrica. Tamaño (vértices + arcos): 6 % incremento: 0%, este es el modelo base. Grafo simple. No, Es un grafo complejo con 4 arcos uniendo 2 vértices. xi

32 Arco dirigido. Si, son usados para representar las relaciones simétricas Sur_de y Norte_de. La dirección de los arcos va en concordancia con la lectura de las relaciones entre los objetos. Arco no dirigido. Si, son usados para representar las relaciones equivalentes cerca y toca. Información completa. Si, en el grafo es representada la relación simétrica Norte_de creada a partir de la relación simétrica original Sur_de. Arco Relation redundante. No, en el modelo no se usan arcos Relation. South_of Spatial object A close touch Spatial object B North_of Figura A.3. Modelo #1 - modelo base. Modelo #2 - replicación simple de tipos de relación, información completa En la Figura A.4 presentamos el segundo modelo creado para representar datos espaciales. Las características del modelo de acuerdo a las métricas son: Num. vértices: 5 vértices, 2 vértices para representar los objetos espaciales y 3 vértices para representar los tipos de relaciones espaciales topológica, distancia y dirección. Por cada tipo de relación especial existente entre dos objetos, se añade un vértice etiquetado con el nombre del tipo de la relación espacial. En el ejemplo existen una relación topológica, una relación de distancia y una relación de dirección. xii

33 Num. arcos: 6 arcos, 3 arcos para representar las relaciones originales, 2 arcos para representar las relaciones equivalentes ( cerca y toca ) creadas a partir de las relaciones originales y un arco para representar la relación simétrica ( Norte_de ) creada también de las relaciones originales. Tamaño (vértices + arcos): 11 % incremento: % Grafo simple. Si, existe a lo más un arco entre cualesquiera 2 vértices dados. Arco dirigido. Si, son usados para representar todas las relaciones. La dirección de los arcos va de los vértices representando los objetos espaciales a los vértices representado los tipos de relaciones espaciales. Arco no dirigido. No, en el modelo no se usan arcos no dirigidos. Información completa. Si, representamos las relaciones simétricas creadas a partir de las relaciones originales. Arco Relation redundante. No, en el modelo no usamos arcos Relation. Distance close close Spatial object A touch Topological touch Spatial object B South_of North_of Direction Figura A.4. Modelo #2 - replicación simple de tipos de relación, información completa. xiii

34 Modelo #3 - doble replicación de tipos de relación, información no completa En la Figura A.5 se presenta el tercer modelo creado para representar datos espaciales usando un enfoque de grafos. Las características del modelo de acuerdo a las métricas son: Num. vértices: 8 vértices, 2 vértices para representar los objetos espaciales y 6 vértices para representar los tipos de relaciones espaciales ( distancia, topológica y dirección ). Por cada tipo de relación espacial entre dos objetos espaciales, se añaden 2 vértices etiquetados con el nombre del tipo de la relación espacial. Por ejemplo, en nuestros datos de prueba existen tres tipos de relaciones: 1 relación topológica, 1 relación distancia y 1 relación dirección, de tal forma, se añaden 6 vértices, 2 por cada tipo de relación espacial. Num. arcos: 9 arcos, 6 arcos Relation para unir los vértices representando los objetos espaciales con los vértices representando los tipos de relaciones espaciales (desde cada vértice representando un objeto espacial nacen 3 arcos ya que existen 3 tipos de relación), y 3 arcos para presentar las relaciones espaciales originales. Estos 3 arcos son usados para unir los vértices representando los tipos de relaciones espaciales. Tamaño (vértices + arcos): 17 % incremento: % Grafo simple. Si, existe a lo más un arco entre cualesquiera 2 vértices dados. Arco dirigido. Si, son usados para representar las relaciones simétricas y los arcos Relation. La dirección de los arcos Relation va desde los vértices representando los objetos espaciales a los vértices representando los tipos de relaciones espaciales. xiv

35 La dirección de los restantes arcos va en concordancia a la lectura de las relaciones espaciales entre los objetos. Arco no dirigido. Si, son usados para representar las relaciones equivalentes. Información completa. No, en el modelo no se representan las relaciones simétricas que son creadas a partir de las relaciones espaciales originales. Arco Relation redundante. Si, en el modelo se usan arcos Relation para representar explícitamente la existencia y tipo de una relación espacial entre 2 objetos espaciales. Adicionalmente, se emplean estos arcos para evitar la creación de grafos complejos. Spatial object A Spatial object B Relation Relation Relation Distance close Distance Relation Relation Relation Topological touch Topological Direction South_of Direction Figura A.5. Modelo #3 - doble replicación de tipos de relación, información no completa. La Tabla A.1 presenta los resultados de las nueve métricas desarrolladas para evaluar las características de cada modelo propuesto (actualmente 5 modelos). El modelo #1 es llamado el modelo base. xv

36 Modelo Num. Num. Tamaño % Grafo Arco Arco no Información Arco Vértices Arcos (v + e) Incremento Simple Dirigido Dirigido Completa Relation Redundante (1) (2) (3) (4) (5) (6) (7) (8) (9) # No Yes Yes Yes No # Yes Yes No Yes No # Yes Yes Yes No Yes # Yes Yes Yes No Yes # Yes Yes No Yes Yes Tabla A.1. Características de los modelos de representación basados en grafos. Las métricas fueron propuestas basaron en el causas/efectos que cada uno de éstas tiene tanto en el grafo creado como en el algoritmo de minado. Visualizamos cuatro tópicos significativos relacionados directamente a estas métricas: 1. Espacio de búsqueda El espacio de búsqueda en un algoritmo de minado de datos basado en grafos consiste de todos los subgrafos que pueden ser derivados de su grafo de entrada, de tal forma, el número de vértices (1) y arcos (2) del grafo creado (3) definen el tamaño del espacio de búsqueda para el sistema del descubrimiento. Por lo tanto, el objetivo debe ser minimizar el número de vértices y arcos usados para crear los grafos pero al mismo tiempo maximizar la representatividad de los mismos. Como podemos ver en la Tabla A.1, el modelo usando el número mínimo de vértices y arcos para representar el conjunto de datos de ejemplo es el modelo #1 (2 vértices y 4 arcos) mientras que el modelo #5 es el caso opuesto (8 vértices y 12 arcos). xvi

37 2. Tiempo de procesamiento El tamaño del espacio de búsqueda juega un papel relevante con respecto al tiempo de procesamiento usado para descubrir patrones. Si tenemos un espacio búsqueda grande se requeriría más tiempo para evaluar todos los subgrafos posibles. Por lo tanto, una comparativa de la métrica "porcentaje de incremento" (4) entre los modelos propuestos se presenta en la Tabla A.1. Recordemos que esta métrica compara el tamaño de un modelo dado con respecto al modelo #1 (modelo base). Por ejemplo, el modelo #5 tiene un incremento de tamaño del grafo de % respecto al modelo #1. Es decir, el algoritmo de minado requerirá la evaluación de % más vértices y/o arcos usando el modelo #5 en vez del modelo #1 para el mismo conjunto de datos. 3. Complejidad del grafo En el capítulo 4 de la disertación se describe el sistema Subdue, nuestra herramienta de minado de datos basada en grafos. Cómo ahí describimos, existe una mayor complejidad para el algoritmo de minado trabajar con grafos complejos en vez de grafos simples (Ej. a lo máximo un arco uniendo cualesquiera dos vértices dado y no ciclos). Por ejemplo, en el proceso de macheo de grafos, la fase de expansión (Subdue emplea un enfoque de "expansión" para descubrir patrones), y la etapa de compresión del grafo. Por lo tanto, el objetivo fue proponer modelos basados en grafos que nos permitieran crear grafos simples. Como podemos ver en la Tabla A.1, únicamente el modelo #1 no permite crear grafos simples. xvii

38 En consecuencia, como estrategia para romper la multiplicidad de arcos entre dos vértices (por ejemplo, vértices representando dos objetos espaciales con dos ó más relaciones espaciales entre ellos) se emplean los enfoques siguientes: Añadir un nuevo vértice etiquetado con el nombre del tipo de la relación (Ej. topológica, distancia y dirección) por cada relación espacial entre los objetos espaciales. Este enfoque es usado en el modelo #2. Agregar un nuevo vértice (modelo #4) o dos nuevos vértices (modelo #3 y modelo #5) etiquetado como el nombre de tipo de la relación espacial (Ej. topológica, distancia y dirección) por cada relación espacial entre los objetos espaciales y unir este nuevo vértice (modelo #4) o nuevos vértices (modelo #3 y modelo #5) con los vértices representando los objetos espaciales por medio de arcos etiquetados como "Relation". El enfoque empleado para unir los vértices difiere para cada modelo tal y como se ha comentado en la definición de cada uno de ellos. Esta nomenclatura se utiliza para representar el hecho de que existe una relación espacial entre los objetos espaciales. Estos arcos son conocidos como arcos Relation redundantes (9). 4. Representatividad de los datos Las métricas arcos dirigidos (6), arcos no dirigidos (7), y información completa (8) son usadas para maximizar la representatividad de los datos pero minimizando, tanto como sea posible, el tamaño del grafo y su complejidad. Los arcos dirigidos son usados para representar las relaciones espaciales simétricas (objeto A Norte_de objeto B, implica, B Sur_de A), los arcos Relation redundantes, los atributos no-espaciales que describen a los objetos espaciales. Arcos no dirigidos son usados para representar las relaciones espaciales xviii

39 equivalentes (la relación es representada por un no dirigido en vez de dos arcos dirigidos). Finalmente, información completa significa que las relaciones espaciales simétricas entre objetos espaciales también son representadas en el modelo. A.4 Minando el grafo La característica de overlap (traslape entre vértices pertenecientes a diferentes instancias de una subestructura) desempeña un rol importante en el sistema de descubrimiento de patrones (subestructuras) en nuestra herramienta de minado de datos basada en grafos, el sistema Subdue. En consecuencia, los resultados generados están condicionados, en un alto porcentaje, al funcionamiento de esta característica. Sin embargo, su implementación actual es ortodoxa: se permite el overlap entre todas las instancias de una subestructura (sin ninguna regla) ó no se permite el overlap entre ninguna instancia perteneciente a una subestructura. Es decir, todo ó nada. En este contexto, proponemos un nuevo enfoque llamado overlap limitado. Una de las ventajas principales de este nuevo enfoque es la capacidad que se le da al usuario para especificar el conjunto de vértices donde el overlap será permitido. Estos vértices podrían representar elementos significativos en el contexto de trabajo. Visualizamos directamente tres motivaciones para proponer el nuevo algoritmo, los cuales serán explicados en las subsecciones siguientes: 1. Reducción del espacio de búsqueda En sistemas de descubrimiento de conocimiento basados en grafos, el algoritmo de minado de datos usa grafos como su representación de conocimiento. El espacio de búsqueda de un xix

40 algoritmo de este tipo consiste de todos los subgrafos que puedan ser derivados del grafo de entrada. El proceso de descubrimiento de subestructuras en Subdue comienza con la creación de subestructuras de un solo vértice a partir del grafo de entrada (una subestructura por cada etiqueta de vértice que exista al menos 2 veces en el grafo). En cada iteración del proceso de descubrimiento, el algoritmo selecciona las mejores subestructuras y expande las instancias de esas subestructuras añadiendo un arco vecino (ó un arco y un nuevo vértice) en todas las direcciones posibles. Pero como parte del proceso para seleccionar las mejores subestructuras y entonces expandirlas existe un proceso de filtrado. En este proceso, de acuerdo al valor del parámetro overlap las instancias de una subestructura son evaluadas: si el overlap es permito entonces las instancias compartiendo vértices son mantenidas, de lo contrario si el overlap no es permito las instancias que comparten vértices son descartadas. La mejor subestructura descubierta por Subdue (de acuerdo a sus métricas de evaluación), en cada iteración, puede ser empleada para comprimir el grafo de entrada el cual entonces puede convertirse en el nuevo grafo de entrada para una siguiente iteración. Después de varias iteraciones, Subdue crea una descripción jerárquica de los datos, donde subestructuras descubiertas en cierta iteración pueden estar definidas en base a subestructuras descubiertas en iteraciones previas. De tal forma, el número de instancias de una subestructura define el espacio de búsqueda (en cada iteración) en el proceso de descubrimiento de subestructuras. Como podemos observar, a través de uso del overlap limitado, se obtiene una reducción del espacio de búsqueda dado que el número de xx

41 instancias candidatas a ser expandidas se condiciona a los valores (vértices) donde el overlap es permitido, siendo éstos establecidos por el usuario. 2. Reducción del tiempo de procesamiento La reducción en el número de instancias candidatas a ser expandidas trae consigo una reducción del espacio de búsqueda, y esto beneficia en una reducción del tiempo de procesamiento para la búsqueda de subestructuras (patrones). Permitiendo el overlap en Subdue, hace que éste sea considerablemente más lento (en cuanto a tiempo de procesamiento) dado que el número de instancias candidatas a ser expandidas, evaluadas, comparadas, y descubiertas se incrementa como hemos descrito. Sin embargo, con la implementación del overlap limitado, el número de instancias a ser procesadas en estas fases decrece resultando en una reducción del tiempo de procesamiento en todo el proceso de descubrimiento de subestructuras. 3. Búsqueda orientada de patrones con overlap (traslape) selectivo El overlap limitado da a usuario la capacidad para definir el conjunto de elementos donde el overlap será permitido y que a su juicio considera relevantes para su contexto de trabajo (Ej. un objeto espacial, un atributo descriptivo). Estos elementos son representados usando vértices de acuerdo al modelo propuesto. En contra parte, el algoritmo descartará aquellos elementos traslapados que el usuario no consideró significativos. Por lo tanto, el overlap limitado proporciona al usuario un medio para implementar una búsqueda orientada de patrones con overlap. Es decir, el usuario delimita el conjunto de xxi

42 elementos que tendrán un papel preponderante en el proceso de descubrimiento de subestructuras. Adicionalmente, esta característica ofrece la ventaja de que el proceso de evaluación de patrones se simplifica dado que el conjunto de resultados generados es menor porque éstos se centran sobre los requerimientos del usuario. A.5 Resultados Como parte de las pruebas para evaluar nuestra propuesta de modelado y minado de datos espaciales usando un enfoque basado en grafos se desarrollaron tres casos de uso ilustrativos. Los dos primeros casos fueron implementados usando un censo de población del centro histórico de la ciudad de Puebla en el año de El tercer caso de uso fue desarrollado usando una base de datos espaciales de la región del volcán Popocatépetl. En esta sección presentamos ejemplos de resultados obtenidos con el caso de uso del volcán Popocatépetl. Supongamos que deseamos conocer características comunes entre las poblaciones (asentamientos poblacionales), carreteras y ríos en la zona que nos ayuden a evaluar/implementar planes de evacuación en caso de una contingencia volcánica. Por ejemplo, características de carreteras empezando en ó cruzando una población, material usado para construir esas carreteras y su estado actual (Ej. pavimentadas, terracería), características de carreteras y ríos que tengan alguna relación entre ellos (Ej. se crucen, se tocan), ríos cerca de poblaciones que en caso de alta precipitación pluvial pudieran representar peligros potenciales. xxii

43 Los experimentos fueron desarrollados usando los 5 modelos de representación actualmente propuestos. En esta sección se presentan patrones encontrados con el modelo #1 entre carreteras y ríos, carreteras y poblaciones, y por último ríos y poblaciones. La idea de organizar la presentación de resultados de esta manera es mostrar los diversos patrones que se pueden encontrar entre estos elementos. Al final de la sección se presentan tablas comparando los resultados obtenidos con cada uno de los 5 modelos. La Figura A.6 muestra el patrón más significativo descubierto entre carreteras y ríos usando el modelo #1. El patrón describe una relación entre carretera categoría terracería traslapando un río categoría escurrimiento en la zona. Este patrón puede considerarse como un indicador del número de carreteras que necesitan ser supervisadas en caso de una contingencia volcánica dado el tipo de material con el que están construidas, y porque ellos atraviesan ríos (la lectura puede ser hecha en orden inverso) que en caso de altas concentraciones pluviales pueden desbordarse e inutilizarlas. Subdue encontró con no overlap 46 instancias del patrón en la segunda iteración; vía overlap estándar encontró 85 instancias en la primera iteración; y a través de overlap limitado también encontró 85 instancias en la segunda iteración. Como podemos observar overlap estándar y overlap limitado encontraron el mismo número de instancias, pero overlap limitado necesitó dos iteraciones para encontrar el mismo patrón. Sin embargo, esto no quiere decir que overlap estándar es mejor que overlap limitado (respecto a tiempo de procesamiento) porque analizando el tiempo global de procesamiento requerido por overlap limitado para finalizar la fase de descubrimiento de subestructuras notamos que es menor que el requerido por overlap estándar. xxiii

44 a) b) c) Figura A.6. Relaciones entre carreteras y ríos usando el modelo #1. El patrón más significativo, usando el modelo #1, encontrado entre carreteras y poblaciones es presentado en la Figura A.7. Este describe una relación entre carretera categoría terracería tocando un asentamiento poblacional categoría construcción. Asentamiento poblacional categoría construcción representa en la capa de datos espaciales asentamientos de la base de datos del volcán áreas habitacionales con alta población, xxiv

45 edificios y una gran cantidad de construcciones usadas para ofrecer servicios a los habitantes. Si asumimos que la gente podría necesitar ser evacuada en caso de una erupción y que las carreteras que serían usadas para ese propósito son de terracería, entonces, esta situación podría convertirse en un problema (Ej. cuello de botella). En este experimento Subdue encontró vía no overlap 6 instancias del patrón en la novena iteración; a través de overlap estándar 9 instancias en la cuarta iteración; y usando overlap limitado 8 instancias en la décima iteración. a) b) c) Figura A.7. Relaciones entre carreteras y poblaciones usando el modelo #1. xxv

46 La Figura A.8 muestra el patrón más significativo encontrado entre ríos y poblaciones usando el modelo #1. El patrón describe una relación entre río categoría escurrimiento cruzando un asentamiento poblacional categoría manzana o construcción" en la zona. asentamientos poblacionales categoría manzana" representa en la capa de datos espaciales asentamientos de la base de datos del volcán áreas habitacionales con poca población, de hecho con muchas áreas deshabitadas, escasos edificios y construcciones. El patrón puede ser utilizado para identificar zonas potenciales de inundación, habitadas por personas, dada la cercanía de ríos. A través de no overlap Subdue encontró 5 instancias del patrón en la décima segunda iteración; usando overlap estándar encontró 5 instancia en la octava iteración; y vía overlap limitado también encontró 5 instancias en octava iteración. Subdue encontró el mismo patrón en los tres casos, sin embargo, usando overlap estándar y overlap limitado la lectura del patrón es más simple. xxvi

47 a) b) c) Figura A.8. Relaciones entre ríos y poblaciones usando el modelo #1. La Tabla A.2 presenta una comparación, por modelo, entre el número de instancias descubiertas/iteraciones necesitadas para descubrirlas y las tres implementaciones de overlap. Por ejemplo, usando el modelo #1, Subdue encontró 46 instancias (en la segunda iteración) de un patrón completo (nuestra definición para reportar un patrón completo es que éste contengan al menos dos objetos espaciales y la relación espacial entre ellos) xxvii

48 conteniendo los objetos espaciales carretera-río vía no overlap. Un valor más alto significa un modelo que permite descubrir más instancias de una subestructura (patrones). Recordemos que Subdue reporta como el mejor patrón (por iteración) la subestructura con el número más alto de instancias descubiertas de esa subestructura. Esta comparación es reportada por cada estructura objeto-objeto (Ej. carretera-río). Nota: NO (no overlap), SO (overlap estándar), LO (overlap limitado). Modelo #1 Modelo #2 Modelo #3 Modelo #4 Modelo #5 NO SO LO NO SO LO NO SO LO NO SO LO NO SO LO Carretera-Río Instancias Iteraciones Carretera-Población Instancias Iteraciones Río-Población Instancias Iteraciones Tabla A.2. Instancias/iteraciones por cada modelo basado en grafos: caso de uso Popocatépetl. La Tabla A.3 presenta una comparación de máximo/mínimo instancias descubiertas por cada implementación de overlap. Un modelo con el valor más alto es mejor porque permite descubrir más instancias de una subestructura. La comparación es presentada por cada estructura objeto-objeto. Por ejemplo en la estructura carretera-río el modelo #1 reportó 46 instancias descubiertas en la segunda iteración (el valor más alto). xxviii

49 Máximo Mínimo Carretera-Río No overlap modelo #1 (segunda iteración) modelos #3 y #4 (segunda iter.) Overlap estándar modelos #1, #4 y #5 (primera iter.) modelo #2 (tercera iteración) Overlap limitado modelo #1 (segunda iteración) modelo #3 (novena iteración) Carretera-Población No overlap modelo #5 (séptima iteración) modelo #3 (quinceava iteración) Overlap estándar modelo #1 (cuarta iteración) modelo #5 (patrón no completo) Overlap limitado modelo #4 (séptima iteración) modelo #3 (treceava iteración) Río-Población No overlap modelo #4 (sexta iteración) modelo #2 (dieciseisava iteración). Overlap estándar modelo #3 (cuarta iteración) modelo #1 (octava iteración). Overlap limitado modelo #1 (octava iteración) modelo #2 (catorceava iteración) Tabla A.3. Max/Min de instancias descubiertas por objeto-objeto /característica overlap. La Tabla A.4 presenta una comparación entre el promedio de instancias descubiertas por modelo. Un valor más alto significa un modelo permitiendo descubrir más instancias de una subestructura. Cada valor representa el promedio de subestructuras descubiertas usando no overlap, overlap estándar y overlap limitado. La comparación es reportada por cada estructura objeto-objeto. Modelo #1 Modelo #2 Modelo #3 Modelo #4 Modelo #5 Carretera-Río Carretera-Población Río-Población Tabla A.4. Promedio de instancias descubiertas por modelo/ objeto-objeto. xxix

50 La Tabla A.5 presenta una comparación entre el promedio de instancias descubiertas por modelo. Un valor más alto significa un modelo permitiendo descubrir más instancias de una subestructura. La comparación es reportada por cada implementación de overlap. Modelo #1 Modelo #2 Modelo #3 Modelo #4 Modelo #5 NO SO LO NO SO LO NO SO LO NO SO LO NO SO LO Tabla A.5. Promedio de instancias descubiertas por modelo/característica overlap. La Tabla A.6 presenta un comparativo final entre el promedio de instancias descubiertas por modelo. Podemos ver en la tabla que el modelo #1 reporta el valor más alto de instancias descubiertas (de acuerdo a nuestros parámetros para reportar instancias completas) en este caso de uso ilustrativo. Los siguientes modelos son el modelo #2 y el modelo #4 respectivamente. Modelo #1 Modelo #2 Modelo #3 Modelo #4 Modelo # Tabla A.6. Promedio de instancias descubiertas por modelo. A.6 Conclusiones La constante interacción entre los seres humanos y su hábitat natural, el planeta de la tierra, genera día a día, nuevos requerimientos asociados al manejo y explotación de datos espaciales. Por ejemplo, el análisis urbano, la prevención los riesgos naturales, la exploración del espacio estelar, la contaminación en los océanos, y la reforestación de los xxx

51 suelos, por nombrar algunos de ellos. La minería de datos espaciales involucra la integración de métodos y técnicas provenientes de diversos campos científicos los cuales nos ayudan, por medio de algoritmos de análisis y de descubrimiento, a producir una enumeración particular de patrones sobre los datos espaciales. Nuestra argumentación en esta disertación doctoral se basa en la idea de que los enfoque de minaría de datos espaciales descritos no consideran todos los elementos encontrados en una base de datos espaciales (datos espaciales, datos no espaciales y relaciones espaciales entre los objetos espaciales) de una manera integral. En consecuencia, se propuso emplear un enfoque basado en grafos para representar estos elementos como un solo conjunto de datos, minarlos como un todo, para de estar forma poder encontrar patrones conteniendo ambos tipos de datos y relaciones espaciales (patrones más descriptivos). En nuestro modelo las relaciones espaciales entre los objetos espaciales son incluidas porque una característica significativa de los datos espaciales es la influencia que los vecinos de un objeto pueden tener en el objeto mismo. En el modelo incluimos tres tipos de relaciones espaciales. Derivado del modelo general se propusieron cinco modelos operativos. Tres aspectos definen las características de un grafo creado con estos modelos: (1) Representación de las relaciones espaciales equivalentes. (2) Representación de relaciones espaciales simétricas. (3) La manera de representar los objetos y sus relaciones. Como parte integrante de nuestra metodología para el minado de datos espaciales usando un enfoque basado en grafos, usamos el sistema Subdue como nuestra herramienta de minado. Se propuso un nuevo algoritmo llamado overlap limitado el cual le da al usuario la xxxi

52 capacidad de especificar el conjunto de vértices sobre los cuales el overlap es permitido. Visualizamos tres motivaciones para proponer este nuevo enfoque: (1) Reducción del espacio de búsqueda. (2) Reducción del tiempo de procesamiento. (3) Búsqueda orientada de patrones con overlap (traslape) selectivo. Para demostrar la viabilidad, capacidad de minado y descubrimiento de patrones usando el enfoque propuesto, se desarrolló un prototipo implementado nuestro modelo para crear los conjuntos de datos basados en grafos, minar esos grafos (a través del llamado al sistema Subdue) y visualización de patrones encontrados. Los resultados generados de los casos de uso desarrollados nos dan un panorama respecto a cómo y qué podríamos obtener usando este enfoque. Es importante comentar el hecho de que podemos utilizar esta metodología de modelado y representación en cualquier dominio que pueda ser representado como grafo. Una vez demostrada la viabilidad de nuestra propuesta, las perspectivas relacionadas a mejorar nuestro trabajo (modelo de representación basado en grafos, algoritmo de minado de datos y sistema prototipo) incluyen los puntos siguientes: Visualización de conocimiento descubierto. Por ejemplo, visualización de resultados sobre las capas espaciales, a través del uso de gráficas, y navegación en la jerarquía de patrones descubiertos usando un enfoque de hypergrafo. Mejoramiento de los algoritmos empleados para crear los conjuntos de datos basados en grafos de acuerdo a los modelos propuestos. La validación de relaciones espaciales entre objeto espaciales es una fase que en la mayoría de los casos requiere gran cantidad de recursos de computacionales. xxxii

53 Minado de los grafos. Se empleó el sistema Subdue como herramienta de minado de datos. Así mismo, se propuso un nuevo algoritmo llamado overlap limitado. El isomorfismo de grafos es un problema NP-completo, de tal forma, debemos ser capaces de que nuestros tiempos de procesamiento para la búsqueda de patrones cumpla parámetros aceptables de eficiencia. Manejo y representación de relaciones entre datos no espaciales. Las relaciones implícitas y explicitas entre los atributos describiendo los objetos espaciales pueden ser incluidos en el modelo con la finalidad de mejorar la representación de los datos. A.7 Contribución La contribución al descubrimiento de conocimiento en el dominio de los datos espaciales, descrito en esta disertación, es el desarrollo de un nuevo enfoque para el modelado y minado de datos espaciales usando una representación basada en grafos. Este enfoque incluye los aspectos siguientes: Se propuso una nueva representación de datos basada en grafos para datos espaciales. Se visualizaron dos objetivos para crear un modelo de datos con estas características. El primero de ellos es crear un único conjunto de datos, basado en grafos, representando estos elementos relacionados. El segundo es emplear este conjunto de datos para alimentar a un sistema de minado de datos basado en grafos, de tal forma que pudiéramos descubrir patrones conteniendo datos espaciales, no espaciales y relaciones espaciales los cuales nos ayuden a describir/entender los xxxiii

54 datos, basado en la premisa, de que estos son elementos relacionados en el mundo real. Se propuso un nuevo algoritmo para descubrir subestructuras (patrones) usando un enfoque de overlap limitado en el sistema Subdue, nuestra herramienta de minado de datos. Visualizamos directamente tres motivaciones para proponer la implementación del nuevo algoritmo: reducción del espacio de búsqueda, reducción del tiempo de procesamiento y búsqueda orientada de patrones con overlap selectivo (specialized overlapping pattern oriented search). Se diseñó e implementó un sistema prototipo implementado el modelo propuesto. El prototipo ofrece una interfase de usuario amigable para el manejo de las capas espaciales con las que se trabajará, para la creación de grafos espaciales y no espaciales, para el minado de estos grafos (a través del llamado al sistema Subdue) y para el despliegue de los resultados generados. xxxiv

55 Appendix B RÉSUMÉ EN FRANÇAIS B.1. Introduction Durant ces dernières années nous avons été témoins de la rapide croissance du nombre, de la capacité et de la dissémination des applications informatiques consacrées à l'obtention, la génération, la manipulation, et le stockage de données dans divers milieux de la vie humaine. Ces actions ont impliqué le recueil d'une grande quantité de données dont l'analyse par des moyens manuels devient chaque jour plus compliquée. Rappelons que dans plusieurs occasions les données crues ont besoin d'être analysées et interprétées pour qu'elles soient transformées en informations utiles et profitables. Cette situation a été résolue grâce à des techniques et des outils informatiques qui, de plus en plus nombreux, nous aident pour atteindre ces objectifs. La découverte de connaissances dans bases de données (KDD, sigle en langue anglaise) est définie comme l'extraction non banale d'informations implicites, préalablement inconnues et potentiellement utiles à partir de données [16]. Il s'agit d'un processus itératif et interactif qui comprend différentes phases. Le noyau du processus est la phase du data mining (fouille de données), qui peut être conceptualisée comme l'application d'algorithmes d'analyse de données et de fouille de xxxv

56 données qui grâce à des paramètres adéquats produisent et découvrent une énumération particulière de patrons (régularités) sur des données elles-mêmes [12]. Dans ce contexte, l'analyse et l'exploitation de données provenant de phénomènes existant en, sur, et sous la surface de la terre appelés données spatiales, ont engendré un nouveau domaine d'investigation appelé fouille de données spatiales, de telle sorte que la fouille de données spatiales s'envisage comme la découverte de connaissances implicites, et préalablement inconnues de données spatiales [16]. Diverses approches pour la fouille de données spatiales ont été développées, dont ne seront examinées que les plus représentatives. B.1.1 Méthodes basées sur la généralisation La généralisation a démontré être une des méthodes effectives pour découvrir des connaissances. Elle a été introduite par la communauté d'apprentissage automatique et se base sur l'apprentissage à partir d'exemples. La découverte de connaissances basée sur la généralisation requiert la construction de hiérarchies de concepts (donnés explicitement par l'expert ou générés automatiquement). Dans le cas des bases de données spatiales, on peut trouver deux types de hiérarchies de concepts: (1) Hiérarchies thématiques, par exemple, généraliser des tomates et des bananes comme des fruits, les fruits et légumes comme des aliments d'origine végétale, etc. (2) Hiérarchies spatiales, par exemple, généraliser une série de points spatiaux comme une région ou pays. Lu et al. [35] étendent une technique d'induction basée sur les attributs, aux bases de données spatiales. Cette technique construit la hiérarchie de généralisation en résumant les xxxvi

57 relations entre les données spatiales et non spatiales à un niveau de concept plus élevé. Les auteurs présentent deux algorithmes basés sur la généralisation: (1) Approche de domination de données non spatiales. Cette méthode réalise en première instance une induction basée sur les attributs aux données non-spatiales et après cette étape, aux données spatiales. (2) Approche de domination de données spatiales ; étant donné la hiérarchie de données spatiales, la généralisation est effectuée d`abord sur ces données et après sur les données non spatiales. B.1.2 Regroupement Le regroupement (clustering) est le processus permettant de regrouper de manière physique ou abstraite des objets en classes d'objets semblables. Cette approche de fouille de données nous aide à construire des groupes représentatifs d'un ensemble d'objets basé sur une mesure de similitude (Ex. distance euclidienne). En d'autres termes, le regroupement de données identifie des groupes (clusters) ou des régions densément peuplés en accord avec une certaine mesure de distance dans un ensemble de données multidimensionnelles. On peut classifier les algorithmes de regroupement en quatre groupes principaux : algorithmes de regroupement basés sur les approches k-means (centre de gravité du cluster) et k-medoid (objet représentatif du cluster), algorithmes hiérarchiques, algorithmes basés sur la localisation des objets (regroupement par densité), et dernièrement ceux qui sont basés sur des grids. xxxvii

58 B.1.3 Associations spatiales Une règle d'association spatiale est une règle qui décrit l'implication d'un ensemble d'objets vers un autre ensemble d'objets d'une base de données spatiales [29]. Un exemple d'une règle d'association spatiale peut être si la entreprise est près de Mexico City est alors une grande entreprise. Une règle d'association spatiale est de la forme X Y, où X et Y sont des ensembles de prédicats spatiaux ou non spatiaux. Il existe plusieurs types de prédicats spatiaux qui pourraient constituer une règle d'association spatiale, par exemple, des relations topologiques comme intersection, recouvrement et des relations d'orientation spatiale comme Gauche_de, ou Ouest_de. B.1.4 Rapprochement et agrégation Les méthodes basées sur le rapprochement et l'agrégation cherchent à analyser les caractéristiques des groupes d'objets (clusters) pour former des groupes d'objets (features) proches entre eux. La proximité devient la mesure permettant de regrouper un ensemble de points dans un cluster en un feature. L'idée de trouver des relations de proximité n'est pas un problème simple comme on pourrait le croire, car il existe trois raisons pour cette affirmation. Supposons qu'on ait un cluster de points et que l'on veuille trouver les k- features les plus proches de lui : La taille et forme du cluster et les features peuvent être très variés., On pourrait avoir une grande quantité de features à examiner, Même pour trouver une forme connue (ex. polygone) qui décrit la forme du cluster, il serait impropre de reporter les features où les limites soient plus proches aux xxxviii

59 limites de celui-ci, parce que la distribution des points à l'intérieur du cluster ne peut pas être uniforme. B.1.5 Fouille de données-images L'extraction de patrons à partir d'images est autre approche de la fouille de données spatiales. Dans la littérature il existe divers travaux de fouille de données développés selon cette approche. Par exemple, Fayyad et al. [14] présentent un système pour l'identification et la catégorisation des volcans sur la surface de Venus à partir d'images transmises par la sonde stellaire Magellan. Dans un autre travail [15] (Second Palomar Observatory Sky Survey) les auteurs ont utilisé des arbres de décision pour la classification des galaxies, des étoiles et des autres objets stellaires. Stolorz et al. [44] d'une part, et Shek et al. [41] d'autre part ont effectué des fouilles de données spatio-temporelles de données géophysiques. B.1.6 Classification de données spatiales La classification de données spatiales a comme objectif de trouver des règles qui divisent un ensemble d'objets en un nombre de groupes dans lesquels les objets de chaque groupe appartiennent à une même classe. Divers types d'information peuvent être utilisés pour caractériser les objets spatiaux, comme par exemple, les attributs non spatiaux d'un objet, les prédicats spatiaux et les fonctions spatiales. L'idée est d'utiliser cette information pour en extraire les attributs pour l'étiquetage de classes (attributs qui divisent les données en classes) et ceux dont les valeurs sont utilisés dans un arbre de décision pour créer de nouvelles branches. xxxix

60 B.1.7 Détection de tendances spatiales La détection de tendances spatiales décrit les changements réguliers d'un ou plus attributs non spatiaux d'un objet quand celui-ci se déplace d'un point de référence initiale. Un exemple de tendance spatiale serait si on s'éloigne du centre historique de la ville de Puebla le prix des terrains décroit. Les trajectoires de mouvement à partir d'un point x sont utilisées pour modéliser ce mouvement, et l'analyse de régression sur les attributs des objets peut être utilisée pour décrire les patrons de changements. Il existe deux types de tendances : globales et locales. B.2. Motivation Bien que les approches précitées effectuent les fouilles de données spatiales avec succès, notre perception est que celles-ci ne considèrent pas tous les éléments trouvés dans une base de données spatiales (données spatiales, données non spatiales et relations spatiales entre les objets spatiales) d'une manière exhaustive. C'est-à-dire que certaines d'entre elles d'abord réalisent la fouille de données spatiales et ensuite celles non spatiales ou vice-versa, et d'autres autorisent des combinaisons de ces éléments mais de manière restreinte. Au vu des considérations précédentes, on propose l'argument suivant : si nous sommes capables de représenter les données spatiales, non spatiales et les relations entre objets spatiales comme un unique ensemble de données, et on les fouille ainsi, on pourrait générer ou trouver des patrons de connaissances qui caractérisent notre ensemble de données contenant ces trois éléments de manière conjointe. Pour un tel objectif, on émet l'hypothèse qu'une xl

61 représentation basée sur graphes est suffisamment flexible et puissante pour représenter ces éléments de manière conjointe, facilement compréhensible et capable de créer différentes représentations du même domaine. Le domaine est décrit en utilisant des graphes, ces graphes se transformant en données d'entrée pour un outil de découverte de connaissances basé sur les graphes lequel utilise des heuristiques pour sélectionner des sous-graphes qui sont considérés comme importants (patrons). Un graphe est défini comme une paire G = (V,E). V = {v 1,,v n } dénote un ensemble fini d'éléments appelés sommets. E est un ensemble d'arcs e satisfaisant E [V] 2. Donc, chaque arc e E est une paire (v i,v j ). Si (v i,v j ) est une paire ordonnée pour n'importe quel (v i,v j ) E, on dit que G = (V,E) est un graphe orienté. Un graphe étiqueté possède des étiquettes associées à leurs sommets et aussi aux arcs. Après avoir proposé la représentation de relations spatiales entre les objets spatiaux, dans la suivante section on détaillera les trois types de relations spatiales que l'on utilisera. B.2.1 Relations spatiales La position explicite des objets spatiaux définit des relations implicites de voisinage (neighborhood) spatial entre eux. Ainsi, l'information sur le voisinage des objets spatiaux constitue un élément de valeur qui doit être considéré comme le travail de fouille de données spatiales. Martin Ester et al. [9][11] introduisent le concept de graphes de voisinage pour représenter explicitement ces relations de voisinage implicite. Les graphes de voisinage peuvent couvrir les relations de voisinage suivantes : xli

62 Topologiques : dérivées du modèle à neuf intersections [6][7][8], ce sont des relations que restent invariables sous des transformations linéaires, c'est-à-dire que si les deux objets simultanément tournent, se déplacent ou changent d'échelle les relations entre eux sont conservées. Distance : la distance entre deux objets compare à un seuil grâce à des opérateurs arithmétiques tels que <, >, =. La distance entre deux objets se définit comme la distance minimum entre eux (Ex. sélectionner tous les éléments à l'intérieur d'un rayon de 50 kilomètres d'un point x). Direction : la relation spatiale R de direction entre deux objets spatiaux A et B (B R A) est définie en utilisant un point représentatif de l'objet A et tous les points de l'objet but B. Le point représentatif de l'objet source A peut être le centre de l'objet ou un point sur ses limites. Ce point représentatif est utilisé comme l'origine d'un système de coordonnées virtuelles et son quadrant définit la direction. Une fois décrite la motivation de notre travail de recherche, nous présentons à la suite le modèle général basée sur des graphes pour représenter les données spatiales. B.3 Représentations basées sur des graphes Comme dit auparavant, notre proposition s'appuie sur la création d'un modèle basé sur les graphes pour représenter conjointement données spatiales, non spatiales, relations spatiales entre les objets spatiaux. L'idée est d'utiliser les graphes générés comme données d'entrée pour un algorithme de fouille de données, de telle sorte que l'algorithme puisse trouver des xlii

63 patrons concernant ces éléments de manière conjointe et non comme des éléments séparés. En conséquence, on propose le modèle de représentation donné en notation UML Figure B.1. Topological Distance Direction 1..* Spatial relation -From: -To: Edge Vertex Spatial object 1..* Non-spatial attribute Figure B.1. Modèle basé sur des graphes pour représenter des données spatiales. Dans ce modèle, les données spatiales (ex. objets spatiaux), données non spatiales (ex. attributs descriptifs), et relations spatiales sont représentées comme une collection d'un ou plusieurs graphes orientés étiquetés. Les sommets peuvent représenter des objets spatiaux, types de relation spatiale entre deux objets (relation binaire), ou des attributs non spatiaux décrivant les objets spatiaux. Les arcs représentent un lien existant entre deux sommets de n'importe quel type. Dépendant du type de sommets qu'un arc unit, celui-ci peut représenter le nom d'un attribut descriptif ou le nom d'une relation spatiale. Les noms des attributs peuvent se rapporter à des descriptions d'objets spatiaux et/ou à entités non spatiales. On utilise des arcs orientés pour représenter les informations directionnelles, de relations entre deux éléments (Ex. objet x couvre objet y) et pour décrire des attributs appartenant à un objet (Ex. objet x a attribut z). xliii

64 Actuellement il existe cinq représentations à partir du modèle général décrit auparavant. Trois aspects définissent les caractéristiques des graphes créés dans chaque modèle: (1) Représentation des relations spatiales équivalentes (Ex. toucher, recouvrir). (2) Représentation des relations spatiales symétriques (Ex. contient-de/dans_de, Nord_de/Sud_de). (3) La manière de représenter les objets et leurs relations dans le modèle. En conséquence, les graphes créés se différencient de manière quantitative et qualitative. Dans la partie quantitative il y a des différences telles que : nombre de sommets et d'arcs employés pour représenter les données spatiales, non spatiales et les relations, création de graphes simples, l'usage des arcs orientés et non orientés pour représenter des relations spatiales et/ou des attributs descriptifs. Dans la partie qualitative on a observé, à travers des expériences, que certains modèles ont une plus grande expressivité pour représenter l'ensemble de données, conditionnant de manière directe la qualité des résultats générés dans le processus de fouille. Ensuite on présente trois des cinq modèles proposés décrivant les métriques d'évaluation créées pour caractériser à chacun d'eux. Modèles Dans le but de décrire les caractéristiques de chaque modèle, on va utiliser à titre d'exemple l'ensemble de données de la Figure B.2. Comme on peut l'observer, notre ensemble de données se compose de deux objets spatiaux, l'objet A représentant une maison et l'objet B représentant un lac, et les trois relations spatiales suivantes: (1) Distance, objet A près de objet B (relation équivalente). (2) Topologique, objet A touche objet B (relation équivalente). (3) Direction, objet A Sud de objet B (relation symétrique). xliv

65 North Object B West East Object A South Figure B.2. Base d'exemple pour caractériser les 3 modèles proposés. Modèle n 1 - modèle de base La Figure B.3 montre le premier modèle créé pour représenter des données spatiales dans l'approche proposée. Les caractéristiques du modèle selon les métriques créées pour sa caractérisation sont : Nom. Sommets : 2 sommets, chacun représentant un objet spatial (objet A et objet B). Nom. Arcs : 4 arcs, 3 arcs pour représenter les relations spatiales originales existantes dans notre ensemble de données d'exemple ( près, touche et Sud_de ) et un arc pour représenter la relation Nord_de, créé à partir de la relation symétrique original Sud_de. Taille (sommets + arcs) : 6 % incrément : 0%, celui-ci est le modèle base. Graphe simple. Non. Car c'est un graphe complexe avec 4 arcs unissant 2 sommets. Arc orienté. Oui, car on utilise les relations symétriques Sud_de et Nord_de. La direction des arcs se fait avec la lecture des relations entre les objets. xlv

66 Arc non orienté. Oui, car on utilise les relations équivalentes près et touche. Information complète. Oui, car dans le graphe est représentée la relation symétrique Nord_de créée à partir de la relation symétrique original Sud_de. Arc Relation redondant. Non, car dans le modèle on n'utilise pas les arcs Relation. South_of Spatial object A close touch Spatial object B North_of Figure B.3. Modèle n 1 - modèle base. Modèle n 2 - réplication simple de types de relation, information complète Dans la Figure B.4 on représente le deuxième modèle créé pour représenter les données spatiales. Les caractéristiques du modèle selon les métriques sont les suivantes : Nom. Sommets : 5 sommets, 2 sommets pour représenter les objets spatiales et trois sommets pour représenter les types de relations spatiales topologique, distance et direction. Pour chaque type de relation spéciale existant entre deux objets, on ajoute un sommet étiqueté avec le nom du type de la relation spatiale. Dans l'exemple il existe une relation topologique, une relation de distance et une relation de direction. Nom. Arcs : 6 arcs, 3 arcs pour représenter les relations originales, 2 arcs pour représenter les relations équivalentes ( près et touche ) créées à partir des xlvi

67 relations originales et un arc pour représenter la relation symétrique ( Nord_de ) créée aussi à partir des relations originales. Taille (sommets + arcs) : 11 % incrément : % Graphe simple. Oui, car il existe un arc supplémentaire entre n'importe quel couple de sommets donnés. Arc orienté. Oui, car on utilise toutes les relations. La direction des arcs va des sommets représentant les objets spatiaux aux sommets représentant les types de relations spatiales. Arc non orienté. Non, car dans le modèle on n'utilise pas d'arcs non orientés. Information complète. Oui, car on représente les relations symétriques créées à partir des relations originales. Arc Relation redondant. Non, dans le modèle on n'utilise pas d'arcs Relation. Distance close close Spatial object A touch Topological touch Spatial object B South_of North_of Direction Figure B.4. Modèle n 2 - réplication simple des types de relation, information complète. xlvii

68 Modèle n 3 - double réplication des types de relation, information non complète Dans la Figure B.5 on présente le troisième modèle créé pour représenter des données spatiales utilisant une approche de graphes. Les caractéristiques en accord avec les métriques sont: Nom. Sommets : 8 sommets, 2 sommets pour représenter les objets spatiaux et 6 sommets pour représenter les types de relations spatiales ( distance, topologique et direction ). Pour chaque type de relation spatiale entre deux objets spatiaux, on ajoute deux sommets étiquetés avec le nom du type de la relation spatiale. Par exemple, dans nos données d'essai il existe trois types de relations : 1 relation topologique, 1 relation distance et 1 relation direction ; ainsi on ajoute 6 sommets, 2 pour chaque type de relation spatiale. Nom. Arcs : 9 arcs, 6 arcs Relation pour unir les sommets représentant les objets spatiaux avec les sommets représentant les types de relations spatiales (de chaque sommets représentant un objet spatial naissent 3 arcs puisqu'il existe 3 types de relation), et 3 arcs pour présenter les relations spatiales originales. Ces 3 arcs sont utilisés pour unir les sommets représentant les types de relations spatiales. Taille (sommets + arcs) : 17 % incrément : % Graphe simple. Oui, car il existe au moins un arc entre n'importe quel couple de sommets données. Arc orienté. Oui, car on utilise les relations symétriques et les arcs Relation. La direction des arcs Relation va des sommets représentant les objets spatiaux aux xlviii

69 sommets représentant les types de relations spatiales. La direction des arcs restant se fait avec la lecture des relations spatiales entre les objets. Arc non orienté. Oui, car on les utilise pour représenter les relations équivalentes. Information complète. Non, car dans le modèle ne sont pas représentées les relations symétriques qui sont créées à partir des relations spatiales originales. Arc Relation redondance Oui, car dans le modèle on utilise des arcs Relation pour représenter explicitement l'existence et type d'une relation spatiale entre 2 objets spatiaux. De plus, on emploie ces arcs pour éviter la création de graphes complexes. Spatial object A Spatial object B Relation Relation Relation Distance close Distance Relation Relation Relation Topological touch Topological Direction South_of Direction Figure B.5. Modèle n 3 - double réplication des types de relation, information non complète. La Table B.1 présente les résultats des neuf métriques développées pour évaluer les caractéristiques de chaque modèle proposé (actuellement 5 modèles). Le modèle n 1 est appelé le modèle base. xlix

70 Modèle Nom. Nom. Dimensions % Graphe Arc Arc non Information Arc Sommet Arcs (s + a) Incrément Simple Orienté Orienté Complète Relation Redondant (1) (2) (3) (4) (5) (6) (7) (8) (9) n Non Oui Oui Oui Non n Oui Oui Non Oui Non n Oui Oui Oui Non Oui n Oui Oui Oui Non Oui n Oui Oui Non Oui Oui Table B.1. Caractéristiques des modèles de représentation basés sur des graphes. Les métriques ont été proposées sur la base de causes à effets de chacune aussi bien dans le graphe créé que dans l'algorithme de fouille. Nous apercevons quatre caractéristiques significatives liées directement à ces métriques. 1. Espace de recherche L'espace de recherche dans un algorithme de fouille de données basé sur des graphes consiste en la liste des sous-graphes qui peuvent être dérivés du graphe initial, de telle manière que les nombres de sommets (1) et arcs (2) du graphe créé (3) définissent la taille de l'espace de recherche pour le système de découverte. Donc, l'objectif doit être de minimiser le nombre de sommets et d'arcs utilisés pour créer les graphes mais en même temps maximiser la représentativité des ceux-ci. Comme on peut le voir dans la Table B.1, le modèle utilisant le nombre minimum de sommets et d'arcs pour représenter l'ensemble de données d'exemple c'est le modèle n 1 (2 sommets et 4 arcs) alors que le modèle n 5 est le cas opposé (8 sommets et 12 arcs). l

71 2. Temps de traitement La taille de l'espace de recherche joue un rôle important touchant au temps de traitement utilisé pour découvrir des patrons. Si on dispose d'un grand espace de recherche, il faudra plus de temps pour évaluer tous les sous-graphes possibles. Donc, une comparaison de la métrique "pourcentage d'incrémentation" (4) entre les modèles proposés figure dans la Table B.1. Rappelons que cette métrique compare la taille d'un modèle donné par rapport au modèle n 1 (modèle base). Par exemple, le modèle n 5 augmente la taille du graphe de % par rapport au modèle n 1. C'est-à-dire que l'algorithme de fouille aura besoin d'évaluer % plus de sommets et/ou d'arcs en utilisant le modèle n 5 au lieu du modèle n 1 pour le même ensemble de données. 3. Complexité du graphe Dans le chapitre 4 du mémoire de thèse, on décrit le système Subdue, notre outil de fouille de données basées sur des graphes, outil provenant de l'université du Texas à Arlington. Comme dit précédemment, il existe une plus grande complexité pour l'algorithme de fouille travaillant avec des graphes complexes au lieu de graphes simples (Ex. au maximum un arc unissant n'importe quel couple de sommets donné). Par exemple, de plus, il faut tenir compte dans le traitement de comparaison des graphes, de la phase d'extension (Subdue emploie une approche d'"extension" pour découvrir des patrons), et de l'étape de compréhension du graphe. Donc, l'objectif a été de proposer des modèles basés sur les graphes qui nous permettent de créer des graphes plus simples. Comme on peut le voir dans la Table B.1, seulement le modèle n 1 ne permet pas de créer des graphes simples. li

72 En conséquence, comme stratégie pour réduire la multiplicité du nombre d'arcs entre deux sommets, on emploie les approches suivantes: Ajouter un nouveau sommet étiqueté avec le nom du type de la relation (Ex. topologique, distance et direction) pour chaque relation spatiale entre les objets spatiaux. Cette approche est utilisée dans le modèle n 2. Ajouter un nouveau sommet (modèle n 4) ou deux nouveaux sommets (modèle n 3 et modèle n 5) étiquetés comme le nom du type de la relation spatiale (Ex. topologique, distance et direction) par chaque relation spatiale entre les objets spatiaux et fusionner ce nouveau sommet (modèle n 4) ou de nouveaux sommets (modèle n 3 et modèle n 5) avec les sommets représentant les objets spatiaux parmi des arcs étiquetés comme "Relation". L'approche employée pour fusionner les sommets diffère pour chaque modèle selon la définition de chacun d'eux. Cette nomenclature est utilisée pour représenter le fait qu'il existe une relation spatiale entre les objets spatiaux. Ces arcs sont connus comme arcs Relation redondants (9). 4. Représentativité des données Les métriques arcs orientés (6), arcs non orientés (7), et information complète (8) sont utilisées pour maximiser la représentativité des données mais minimisant, aussi tant que possible, la taille du graphe et sa complexité. Les arcs orientés sont utilisés pour représenter les relations spatiales symétriques (objet A Nord_of objet B, implique, B Sud_of A), les arcs Relation redondantes, les attributs non spatiaux qui décrivent les objets spatiaux. Les arcs non orientés sont utilisés pour représenter les relations spatiales équivalentes (la lii

73 relation est représentée par un arc non orienté au lieu de deux arcs orientés). Finalement, l'information complète signifie que les relations spatiales symétriques entre objets spatiaux aussi sont représentées dans le modèle. B.4 Fouille du graphe La caractéristique de recouvrement (partage de sommets appartenant à différentes instances d'une sous-structure) accomplit un rôle important dans le système de découverte de patrons (sous-structures) dans notre outil de fouille de données basés sur les graphes à savoir le système Subdue. En conséquence, les résultats générés sont conditionnés par le fonctionnement de cette caractéristique. Pourtant, son implémentation actuelle est régulière : permettre le recouvrement entre toutes les instances d'une sous-structure (sans aucune règle) ou ne pas autoriser le recouvrement entre aucune instance appartenant à une sousstructure. C'est-à-dire, tout ou rien. Dans ce contexte, on propose une nouvelle approche appelée recouvrement limité. Un des avantages principaux de cette nouvelle approche est la capacité d'autoriser l'usager à spécifier l'ensemble de sommets où le recouvrement sera permis. Ces sommets pourraient représenter les éléments significatifs dans le contexte de travail. On donnera directement trois motivations pour proposer un nouvel algorithme, lesquelles seront expliquées dans les sous-sections suivantes: 1. Réduction de l'espace de recherche Dans les systèmes de découverte de connaissances basés sur des graphes, l'algorithme de fouille de données utilise des graphes comme représentation de connaissance. L'espace de liii

74 recherche d'un tel algorithme consiste en tous les sous-graphes qui peuvent être dérivés du graphe initial. Le processus de découverte de sous-structures en Subdue commence par la création de sous-structures d'un seul sommet à partir du graphe d'entrée (une sous-structure par chaque étiquette de sommet qui existe au moins 2 fois dans le graphe). Dans chaque itération du processus de découverte, l'algorithme sélectionnera les meilleures sousstructures et étend les instances de ces sous-structures en ajoutant un arc voisin (ou un arc et un nouveau sommet) dans toutes les directions possibles. Mais comme partie du processus de sélection des meilleurs substructures et donc d'extension, il existe aussi un processus de filtrage. Dans ce processus, en accord avec la valeur du paramètre de recouvrement, les instances d'une sous-structure sont évaluées : si le recouvrement est permis, les instances partageant des sommets sont maintenues ; au contraire si le recouvrement n'est pas autorisé les instances qui partagent les sommets sont écartés. La meilleure sous-structure découverte par Subdue (en accord à les métriques d'évaluation), dans chaque itération, peut être employée pour comprimer le graphe initial qui peut donc se transformer en un nouveau graphe d'entrée pour une interaction suivante. Après plusieurs itérations, Subdue crée une description hiérarchique des données, où les substructures découvertes lors de certaines itérations peuvent être définies sur la base de sous-structures découvertes lors des itérations préalables. Ainsi, le nombre d instances d'une substructure définit l'espace de recherche (dans chaque itération) dans le processus de découverte de sous-structures. Comme on peut l'observer à travers de l'utilisation du recouvrement limité, on obtient une réduction de l'espace de recherche puisque le nombre liv

75 d'instances candidates à être étendues conditionne les valeurs (sommets) où le recouvrement est permis. 2. Réduction du temps de traitement La réduction du nombre d'instances candidates à être étendues apporte une réduction de l'espace de recherche, et cela contribue à une réduction du temps de traitement pour la recherche de sous-structures (patrons). Permettre le recouvrement en Subdue, implique une évolution du temps de calcul puisqu'augmente le nombre d'instances candidates à être étendues, évaluées, comparées, et découvertes. Pourtant, avec l'implémentation du recouvrement limité, le nombre d'instances à être traitées dans ces phases décroît contribuant à une réduction du temps de traitement dans tout le calcul de découverte de sous-structures. 3. Recherche orientée de patrons avec recouvrement (partager) sélectif Le recouvrement limité donne à l usager la capacité de définir des ensembles d éléments où le recouvrement sera permis et qu'à son avis il considère comme pertinents pour son contexte de travail (Ex. un objet spatial, un attribut descriptif). Ces éléments sont représentés en utilisant des sommets en accord avec le modèle proposé. En contre partie, l'algorithme écartera les éléments que l'utilisateur n'a pas considérés comme significatifs. Par conséquent, le recouvrement limité fournit à l'utilisateur un moyen pour mettre en œuvre une recherche orientée vers des patrons avec recouvrement. C'est-à-dire, l'utilisateur délimite l'ensemble d'éléments qui auront un rôle prépondérant dans le processus de lv

76 découverte de sous-structures. De plus, cette caractéristique offre l'avantage que le processus d'évaluation de patrons est simplifié puisque l'ensemble de résultats produits est plus petit parce que ceux-ci se centrent sur les demandes de l'utilisateur. B.5 Résultats A partir des essais pour évaluer notre proposition de modélisation et de fouille de données spatiales en utilisant un modèle basé sur les graphes, on a développé trois cas d'utilisation exemplaires. Les deux premiers cas ont été mis en œuvre en utilisant un recensement de population du centre historique de la ville de Puebla durant l'année de Le troisième cas d'utilisation a été développé en utilisant une base de données spatiales de la région du volcan Popocatépetl. Dans cette section nous présentons des exemples de résultats obtenus avec le cas du volcan Popocatépetl. Supposons que nous souhaitons connaître des caractéristiques communes entre les habitats, routes et rivières dans la zone qui nous aident à évaluer ou mettre en œuvre des plans d'évacuation en cas d'éventualité volcanique : par exemple, caractéristiques de routes commençant dans ou croisant un habitat, un matériel utilisé pour construire ces routes et leur état actuel (ex. pavées, non pavées), caractéristiques des routes et des rivières qui ont une certaine relation entre elles (ex. croisement), rivières près d'habitats qui en cas de haute précipitation pluviale pourraient présenter des dangers potentiels. lvi

77 Les expériences ont été développées en utilisant les 5 modèles de représentation actuellement proposés. Dans cette section on présente des patrons trouvés avec le modèle n 1 entre routes et rivières, routes et habitats, et finalement rivières et habitats. L'idée d'organiser la présentation des résultats de cette manière est de montrer les divers patrons qui peuvent être découverts entre ces éléments. À la fin de la section on présente des tableaux comparant les résultats obtenus avec chacun des 5 modèles. La Figure B.6 montre le patron le plus significatif découvert entre des routes et des rivières en utilisant le modèle n 1. Le patron décrit une relation entre "route de catégorie non pavée qui traverse une rivière catégorie écoulement" dans la zone. Ce patron peut être considéré comme un indicateur du nombre de routes qui ont besoin d'être vérifiées en cas d'éventualité volcanique vu le type de matériel avec lequel elles sont construites, et parce qu'elles traversent des rivières (la lecture peut être faite en sens inverse) qui en cas de hautes concentrations pluviales peuvent déborder et les mettre hors d'usage. Subdue a trouvé avec non recouvrement 46 instances du patron dans la seconde itération ; par l'intermédiaire du recouvrement standard il a trouvé 85 instances dans la première itération ; et à travers le recouvrement limité il a aussi trouvé 85 instances dans la seconde itération. Comme nous pouvons observer les recouvrements standard et recouvrement limité ont trouvé le même nombre d'instances, mais le recouvrement limité a eu besoin de deux interactions pour trouver le même patron. Toutefois, ceci ne veut pas dire que le recouvrement standard est meilleur que le recouvrement limité (en ce qui concerne temps de traitement) parce qu'en analysant le temps global de traitement requis pour le recouvrement limité pour achever la phase de découverte de sous-structures nous remarquons qu'il est plus petit que celui requis par le recouvrement standard. lvii

78 a) b) c) Figure B.6. Relations entre des routes et des rivières en utilisant le modèle n 1. Le patron le plus significatif, utilisant le modèle n 1, trouvé entre des routes et des habitats est présenté dans la Figure B.7. Il décrit une relation entre "route catégorie non pavée en touchant un habitat catégorie construction". "habitat catégorie construction" représente dans la couche de données spatiales "habitat" de la base de données du volcan, habitats avec grosse population, bâtiments et une grande quantité de constructions utilisées pour offrir lviii

79 des services aux habitants. Si nous faisons l'hypothèse que les gens pourraient avoir besoin d'être évacués en cas d'éruption et que les routes utilisées pour ce but sont non pavées, alors, cette situation pourrait se transformer en un problème (Ex. embouteillage). Dans cette expérience Subdue a trouvé par l'intermédiaire du non recouvrement 6 instances du patron dans la neuvième itération ; par le biais du recouvrement standard 9 instances dans la quatrième itération ; et en utilisant recouvrement limité 8 instances dans la dixième itération. a) b) c) Figure B.7. Relations entre des routes et des villes en utilisant le modèle n 1. lix

80 La Figure B.8 montre le patron le plus significatif trouvé entre des rivières et des populations en utilisant le modèle n 1. Le patron décrit une relation entre "rivière catégorie écoulement qui traverse un habitat catégorie "îlot" ou "bâtiment" dans la zone. "habitat catégorie îlot" représente dans la couche de données spatiales "habitat" de la base de données du volcan, des villages avec peu de population, en fait avec beaucoup de secteurs dépeuplés, bâtiments et constructions précaires. Le patron peut être utilisé pour identifier des zones potentielles d'inondation, habitées par des personnes isolées habitant aux alentours de rivières. À travers le non recouvrement, Subdue a trouvé 5 instances du patron dans la dixième seconde itération ; en utilisant le recouvrement standard a trouvé 5 instance dans la huitième itération ; et par l'intermédiaire du recouvrement limité il a aussi trouvé 5 instances en huitième itération. Subdue a trouvé, toutefois, le même patron dans les trois cas mais en utilisant le recouvrement standard et le recouvrement limité. Le Tableau B.2 présente une comparaison, par modèle, entre le nombre d'instances découvertes/itérations nécessaires pour les découvrir et les trois mises en œuvre du recouvrement. Par exemple, en utilisant le modèle n 1, Subdue a trouvé 46 instances (dans la seconde itération) d'un patron "complet" (notre définition pour décrire un patron "complet" est que celui-ci contient au moins deux objets spatiaux et la relation spatiale entre eux) en contenant les objets spatiales route-rivière par l'intermédiaire du non recouvrement. Une valeur plus élevée signifie un modèle qui permet de découvrir plus d'instances d'une sous-structure (patrons). Rappelons que Subdue décrit comme le meilleur patron (par itération) la sous-structure avec le nombre plus élevé d'instances découvertes de lx

81 cette sous-structure. Cette comparaison est effectuée par chaque structure "objet-objet" (Ex. route-rivière). a) b) c) Figure B.8. Relations entre des rivières et des villes en utilisant le modèle n 1. lxi

82 Annotation: NO (non recouvrement), SO (recouvrement standard), LO (recouvrement limité). Modèle n 1 Modèle n 2 Modèle n 3 Modèle n 4 Modèle n 5 NO SO LO NO SO LO NO SO LO NO SO LO NO SO LO Route-Rivière Instances Itération Route-Ville Instances Itération Rivière-Ville Instances Itération Table B.2. Instances/itérations par chaque modèle basé sur des graphes : cas d'utilisation Popocatépetl. Le Tableau B.3 présente une comparaison de maximum/minimum instances découvertes par chaque mise en œuvre du recouvrement. Un modèle avec la valeur plus élevée est meilleur parce qu'il permet de découvrir davantage d'instances d'une sous-structure. La comparaison est présentée par chaque structure "objet-objet". Par exemple dans la structure route-rivière le modèle n 1 a trouvé 46 instances découvertes dans la seconde itération (la valeur plus élevée). lxii

83 Maximum Minimum Route-Rivière Non recouvrement modèle n 1 (deuxième itération) modèles n 3 et n 4 (deuxième itér.) Recouvrement standard modèles n 1, n 4 et n 5 (première itér.) modèle n 2 (troisième itération) Recouvrement limité modèle n 1 (deuxième itération) modèle n 3 (neuvième itération) Route-Ville Non recouvrement modèle n 5 (septième itération) modèle n 3 (quinzième itération) Recouvrement standard modèle n 1 (quatrième itération) modèle n 5 (modèle non complet) Recouvrement limité modèle n 4 (septième itération) modèle n 3 (treizième itération) Rivière-Ville Non recouvrement modèle n 4 (sixième itération) modèle n 2 (seizième itération). Recouvrement standard modèle n 3 (quatrième itération) modèle n 1 (huitième itération). Recouvrement limité modèle n 1 (huitième itération) modèle n 2 (quatorzième itération) Table B.3. Max/Min d'instances découvertes par "objet-objet"/caractéristique recouvrement. Le Tableau B.4 présente une comparaison entre la moyenne d'instances découvertes par modèle. Une valeur plus élevée signifie un modèle permettant de découvrir plus d'instances d'une sous-structure. Chaque valeur représente la moyenne de sous-structures découvertes en utilisant le non recouvrement, le recouvrement standard et le recouvrement limité. La comparaison est donnée par chaque structure "objet-objet". Modèle n 1 Modèle n 2 Modèle n 3 Modèle n 4 Modèle n 5 Route-Rivière Route-Ville Rivière-Ville Table B.4. Moyenne d'instances découvertes par modèle/"objet-objet". lxiii

84 Le Tableau B.5 présente une comparaison entre la moyenne d'instances découvertes par modèle. Une valeur élevée signifie un modèle permettant de découvrir plus d'instances d'une sous-structure. La comparaison est donnée par chaque mise en œuvre du recouvrement. Modèle n 1 Modèle n 2 Modèle n 3 Modèle n 4 Modèle n 5 NO SO LO NO SO LO NO SO LO NO SO LO NO SO LO Table B.5. Moyenne d'instances découvertes par modèle/caractéristique recouvrement. Le Tableau B.6 présente une fin comparative entre la moyenne d'instances découvertes par modèle. Nous pouvons voir dans le tableau que le modèle n 1 donne la valeur plus élevée d'instances découvertes (en accord avec nos paramètres des instances complètes) dans ce cas d'utilisation exemplaire. Les modèles suivants sont le modèle n 2 et le modèle n 4 respectivement. Modèle n 1 Modèle n 2 Modèle n 3 Modèle n 4 Modèle n Table B.6. Moyenne d'instances découvertes par modèle. B.6 Conclusions L'interaction constante entre les êtres humains et leur habitat naturel, la planète terre, produit jour après jour, de nouvelles demandes associées à la manipulation et l'exploitation lxiv

85 de données spatiales. Par exemple, l'analyse urbaine, la prévention les risques naturels, l'exploration de l'espace stellaire, la pollution des océans, et le déboisement des sols, pour nommer certains d'entre eux. La fouille de données spatiales intègre l'intégration de méthodes et techniques provenant de divers domaines scientifiques lesquels nous aident, au moyen d'algorithmes d'analyse et de découverte, à produire une énumération particulière de patrons sur les données spatiales. Notre argumentation dans ce mémoire de thèse se base sur l'idée que la fouille de données spatiales ne considère pas tous les éléments trouvés dans une base de données spatiales (données spatiales, données non spatiales et relations spatiales entre les objets spatiales) d'une manière exhaustive. Par conséquent, on a proposé d'employer une analyse basée sur les graphes pour représenter ces éléments comme un unique ensemble de données, les fouiller comme un tout, de façon à pouvoir découvrir des patrons contenant les deux types données et relations spatiales (patrons les plus descriptifs). Dans notre modèle les relations spatiales entre les objets spatiaux sont incluses parce qu'une caractéristique significative des données spatiales est l'influence que les voisins d'un objet peuvent avoir avec l'objet lui-même. Dans notre modèle nous incluons trois types de relations spatiales. A partir du modèle général on a proposé cinq modèles opérationnels. Trois aspects définissent les caractéristiques d'un graphe créé avec ces modèles : (1) Représentation des relations spatiales équivalentes. (2) Représentation de relations spatiales symétriques. (3) La manière de représenter les objets et leurs relations. Comme partie intégrante de notre méthodologie pour la fouille de données spatiales en utilisant une analyse basée sur les graphes, nous utilisons le système Subdue comme outil de fouille. lxv

86 Nous avons proposé un nouvel algorithme appelé recouvrement limité lequel donne à l'utilisateur la capacité de spécifier l'ensemble de sommets sur lesquels le recouvrement est permis. Nous donnons trois motivations pour proposer cette nouvelle analyse: (1) Réduction de l'espace de recherche. (2) Réduction du temps de processus. (3) Recherche orientée de patrons avec recouvrement (partager) sélectif. Pour démontrer la viabilité, capacité de fouille et de découverte de patrons en utilisant l'analyse proposée, on a développé un prototype pour la mise en œuvre de notre modèle en créant les ensembles de données basées sur les graphes, pour fouiller ces graphes (par l'entremise du système Subdue) et pour visualiser les patrons découverts. Les résultats produits des cas d'utilisation développés nous donnent un vaste panorama de ce que nous pourrions obtenir en utilisant cette analyse. Il est important de généraliser le fait que nous pouvons utiliser cette méthodologie de modélisation et de représentation dans tout domaine qui peut être représenté pour un graphe. Les perspectives d'amélioration de notre travail (modèle basé en graphes, algorithme de fouille de données et système prototype) incluent les points suivants : Visualisation de connaissances découvertes. Par exemple, visualisation de résultats sur les couches spatiales, à travers l'utilisation d'icones, et la navigation dans la hiérarchie de patrons découverts. Amélioration des algorithmes employés pour créer les ensembles de données basées sur des graphes en accord avec les modèles proposés. La validation des lxvi

87 relations spatiales entre objets spatiaux est une phase qui dans la majorité des cas requiert une grande quantité de ressources en calcul. Fouille de graphes. On a employé le système Subdue comme outil de fouille de données. De plus, on a proposé un nouvel algorithme appelé recouvrement limité. L'isomorphisme de graphes est un problème NP-complet et par conséquent, les algorithmes devront être capables de réduire les temps de traitement pour la recherche de patrons. Manipulation et représentation de relations entre des données non spatiales. Les relations implicites et explicites entre les attributs en décrivant les objets spatiaux peuvent être inclus dans le modèle afin d'améliorer la représentation des données. B.7 Contribution La contribution à la découverte de connaissances dans le domaine des données spatiales, décrite dans ce mémoire, est le fruit d'une nouvelle façon de modéliser et de fouiller les données spatiales en utilisant une représentation basée sur des graphes. Cette approche inclut les aspects suivants : Nous avons proposé une nouvelle représentation de données basée sur les graphes pour traiter les données spatiales. On a atteint deux objectifs pour créer un modèle de données avec ces caractéristiques. Le premier d'entre eux est de créer un seul ensemble de données, basé sur des graphes, en représentant ces éléments en rapport les uns avec les autres. Le deuxième est d'employer cet ensemble de données pour lxvii

88 alimenter un système de fouille de données basé sur des graphes, de telle sorte que puissions-nous découvrir des patrons contenant des données spatiales, non spatiales et des relations spatiales lesquelles nous aident à décrire et comprendre les données, tout ceci basé la prémisse qu'il s'agit d'éléments en rapport dans le monde réel. Nous avons proposé un nouvel algorithme pour découvrir des sous-structures (patrons) en utilisant une analyse de recouvrement limité dans le système Subdue. Nous donnons directement trois motivations pour proposer la mise en œuvre du nouvel algorithme : réduction de l'espace recherche, réduction du temps de processus et recherche orientée de patrons avec recouvrement sélectif (specialized overlapping pattern oriented search). On a conçu et on a mis en œuvre un prototype pour le modèle proposé. Le prototype offre une interface d'utilisateur convivial pour la manipulation des couches spatiales avec lesquelles on travaillera, pour la création de graphes spatiaux et non spatiaux, pour la fouille de ces graphes (à travers du système Subdue) et pour le déploiement des résultats produits. lxviii

89 Chapter 1 INTRODUCTION Due to the advances in data generation and recompilation we are facing a continuous growth in data collections. The analysis and interpretation of this data by manual techniques is sometimes a tough task; therefore, different methods have been proposed to help us transform it into useful knowledge. Knowledge discovery in databases (KDD) is defined as the non-trivial extraction of implicit, previously unknown, and potentially useful information from data [16]. This is an interactive and iterative process that involves several phases. The data mining phase is the nucleus of the process; it consists of the application of data analysis and discovery algorithms that, under acceptable computational efficiency limitations, produces a particular enumeration of patterns over the data [12]. Data mining involves the integration of methods from different scientific fields like machine learning, database technology and statistics. The first approaches developed focused on the discovery of knowledge from relational data. Nowadays, however, terms like geoprocessing and Geographic Information System are used widely in daily life. This is a result of the improvement in the human capabilities to create, manipulate, store and use data from phenomena on, above or below the earth s 1

90 surface. This data is known as spatial data. Spatial data mining focuses on the discovery of implicit and previously unknown knowledge in spatial databases [16]. Spatial data has many features that distinguish it from relational data such as the relationships among the objects, complexity, and the query language used to access them. 1.1 Motivation As a result of the growth in the volume of spatial datasets and the necessity for tools to help us transform them into useful information, several approaches have been developed for knowledge discovery from spatial data: the principle in the Generalization-based methods [21, 35] is that data and objects often contain detailed information at primitive concept levels but sometimes it is desirable to summarize that information and present it at a higher concept level. Clustering [26, 28, 37, 40, 46] is the process of grouping physical or abstract objects into classes (clusters) of similar objects so that the members of a cluster are as similar as possible whereas members of different clusters differ as much as possible from each other. Spatial associations [13, 29] discover rules that associate one or more spatial objects with other spatial objects. In Approximation and aggregation methods [27] the idea is to analyze the characteristics of the clusters in terms of the features (objects) close to them. Aggregate proximity is the measure of closeness of the set of points in the cluster to a feature. Mining in image and raster databases [13, 14] can be viewed as another approach of spatial data mining. Some applications of this approach (based on images) are automatic recognition and categorization of astronomical objects; classification of stars, galaxies and other stellar objects. Spatial Classification [31] is the task of assigning objects to a set of 2

91 classes based on their attribute values. Spatial Trend Detection [10] describes the regular change of one or more non-spatial attributes of an object when moving away from a given starting object. However, we argue that these approaches do not consider all the elements found in a spatial database (spatial data, non-spatial data and spatial relations among the spatial objects) in an extended way. Some of them focus first on spatial data and then on non-spatial data or vice versa, and others consider restricted combinations of these elements. We think that it is possible to enhance the generated results of the data mining task by mining them as a whole and not as separated elements (in the real world they are related). In this context, we propose to use a graph-based representation since it provides the flexibility to describe these elements together and this is the motivation to explore the area of graph-based spatial knowledge discovery. Our work is based on the hypothesis that if we create a graph-based model to represent together spatial and non-spatial data and if we use this model for generating a dataset composed of both type of data, then we can apply data mining techniques using this knowledge representation to spatial and non-spatial data at the same time and get descriptive patterns considering both kind of data about objects and the spatial relations among them. 3

92 1.2 Proposal Our proposal is to create a unique graph-based model to represent spatial data, non-spatial data and the spatial relations among spatial objects. We will generate datasets composed of graphs with a set of these three elements. We consider that by mining a dataset with these characteristics a graph-based mining tool can search patterns involving all these elements at the same time improving the results of the spatial analysis task. A significant characteristic of spatial data is that the attributes of the neighbors of an object may have an influence on the object itself, therefore, to enhance the data representativeness we propose to include in the model three relationship types (topological, orientation, and distance relations). Moreover, from a general point of view spatial database systems are relational databases plus a concept of spatial location and spatial extension. So, most KDD algorithms for spatial databases must make use of those neighborhood relationships because it is the main difference between KDD in relational database system and spatial database system. In the model the spatial data (i.e., spatial objects), non-spatial data (i.e., non-spatial attributes), and spatial relations are represented as a collection of one or more directed graphs. A directed graph contains a collection of vertices and edges representing all these elements. Vertices represent either spatial objects, spatial relation types between two spatial objects (binary relation), or non-spatial attributes describing the spatial objects. Edges represent a link between two vertices of any type. According to the type of vertices that an edge joins, it can represent either an attribute name or a spatial relation name. The attribute name can refer to a spatial object or a non-spatial entity. We use directed edges to represent 4

93 directional information of relations among elements (i.e., object x covers object y) and to describe attributes about objects (i.e., object x has attribute z). We propose to adopt the Subdue system [24, 45], a general graph-based data mining system developed at the University of Texas at Arlington, as our mining tool. Subdue discovers substructures using a graph-based representation of structural databases. The substructures (a connected subgraph within the graphical representation) describe structural concepts in the data (i.e., patterns). The discovery algorithm follows a computationally constrained beam search. The algorithm begins with the substructure matching a single vertex in the graph. On each iteration, the algorithm selects the best substructure and incrementally expands the instances of the substructure. An instance of a substructure in the input graph is a subgraph that matches (graph theoretically) that substructure. A special feature named overlap has a primary role in the substructures discovery process and consequently a direct impact over the generated results. However, it is currently implemented in an orthodox way: all or nothing. If we set overlap to true, Subdue will allow the overlap among all instances sharing at least one vertex. On the other hand, if overlap is set to false, Subdue will not allow the overlap among instances sharing at least one vertex. We argue that a third option is needed: a limited overlap. With this option we give the user the capability to set over which vertices, the overlap will be allowed (vertices representing remarkable elements that refer, for instance, to a spatial object in a spatial database or to some characteristic defining a particular topic of a dataset). We visualize directly three motivations issues to propose the implementation of the new algorithm: search space reduction, processing time reduction, and pattern oriented search. 5

94 1.3 Contribution The contribution to the discovery knowledge in the spatial data domain, described in this dissertation, is the development of a new approach for spatial data modeling and mining using a graph-based representation. This approach includes the following issues: We proposed a new graph-based data representation for spatial data, non-spatial data and spatial relations among the spatial objects. We visualize two objectives for creating a data model with these characteristics. The first one is to create a unique graph-based dataset representing these related elements. The second one is to use this dataset to feed a graph-based mining system, so we can discover single patterns (involving these elements) that will help us to describe/understand the data, based on the premise that they are related elements in the real world. We proposed a new algorithm to discover substructures (patterns) using a limited overlap approach in the Subdue system. We visualize directly three motivations issues to propose the implementation of the new algorithm: search space reduction, processing time reduction, and specialized overlapping pattern oriented search. We designed and developed a prototype system implementing the proposed model. The prototype provides to the user a friendly graphical user interface for managing the spatial layers to work with, for creating spatial and non-spatial graphs, for mining those graphs (by calling the Subdue system) and for displaying the generated results. 6

95 1.4 Organization of the thesis The thesis is structured in the following way: Chapter 1 presents the motivation, proposal, and contributions of the thesis. Related work is described in Chapter 2. In Chapter 3 we detail our graph-based model to represent together spatial data, non-spatial data, and spatial relations. Our graph-based data mining tool, the Subdue system, and the new limited overlap algorithm are described in Chapter 4. A prototype system implementing our model is presented in Chapter 5. Use-cases showing the applicability of our proposal are described in Chapter 6. Finally, conclusions and final remarks are commented in Chapter 7. 7

96 Chapter 2 RELATED WORK Knowledge Discovery in Databases, spatial data, non-spatial data, Data Mining, Spatial Data Mining, Geographic Information System, geoprocessing, and Geomatics are terms broadly used nowadays. This chapter presents an outline of these issues. The Geographic Information Systems are presented in Section 2.1. In Section 2.2 we describe related work about Knowledge Discovery in Databases and data mining. Finally, Section 2.3 describes the three types of spatial relations we propose to incorporate in our model to represent spatial data. 2.1 Geographic Information System (GIS) A Geographic Information System is defined as a tool for the manipulation of geographic data [2]. The GIS performs a great diversity of functions; some of them are the compilation, verification, storage, retrieval, manipulation, update and visualization of geographic data. Additionally, one of its more important features is the inclusion of data analysis modules. 8

97 All these functions are applied by a GIS to geographic data, stored generally in a geographic database. The data processed and manipulated is georeferenced, that is, it is assigned to a specific location on the Earth s surface using a coordinate system. The GIS can process data from several sources. For example, data collected from maps, images and photography, statistical data from mathematical analyses, and data from CAD systems (Computer-Assisted Design). A GIS organizes and handles digital data stored generally in a geographic database. The databases are important in the GIS technology because they store geographic data with a structured form, allowing the data to be used for many tasks. Many GIS implement additional functionalities when they use database management systems (DBMS) to store and to handle all or some data in an independent subsystem. The diversity of the uses of the GIS has generated the proliferation of a great variety of definitions of GIS. A user generally defines a GIS according to what he uses it for and his own experience and abilities. Some of these definitions are: A system for data processing designed for the production and/or visualization of maps. An information system to respond to questions about Earth s properties or soil types. A system for decision making support in situations of natural phenomena. An electronic positioning system to be used by terrestrial or marine transportation. 9

98 The GIS frequently is named according to its application field [2]. For example, when they are used to manage land s registries they are generally called Land Information Systems (LIS); in applications of municipal and natural resources they are important components of the Urban Information Systems (UIS), and Natural Resources Information Systems (NRIS) respectively. The term Automatic Mapping/Facility Management (AM/FM) is used by public maintenance companies, transportation agencies, and local governments for systems dedicated to the operation and maintenance of networks. The GIS s field (see Figure 2.1) involves many disciplines, applications, data types, and end users, for example: Disciplines: Computer Science, Cartography, Spatial Analysis, Topography, Hydrography, Statistic, Information Sciences, Planning, etc. Applications: Operation and maintenance of networks and other devices, administration of natural resources, highway planning, map production, urban analysis, planning analysis, etc. Data: Digital maps, digital images and photography, data satellites, video images, etc. Users: Planners, topographers, vulcanologists, geographers, environmentalist, engineers, etc. 10

99 Users Disciplines Data Applications Figure 2.1. Geographic Information System. Geographic Information Systems involve the uses of systems and science. Their use arise questions such as: How does a GIS user know that the results obtained are accurate? How can user interfaces be made readily understandable by novice users? M. F. Goodchild published in 1992 a paper where he argued that questions such as these and their systematic study constituted a science. Geoprocessing study the fundamental issues arising from geographic information (i.e., creation, handling, storage and use of the information). The term Geomatics, the fusion of ideas from geosciences and informatics, is defined as the umbrella covering all fields that are today important for understanding and further developing information systems in this context [32, 33, 34]. 11

100 A GIS offers its users greater capabilities to process datasets than those offered by manual systems. In a GIS database the data is stored in a structured form, unlike the manual systems where the data is stored in files, maps, and/or reports. The data can be recovered from geographic databases and processed faster and more safely than in manual systems. We can classify the GIS s users into two groups. In the first group, we find professional operators. They are people trained in some particular software and they know the capabilities of this technology. Many times these people do not use the results of their work, but they pass them on to the end users. The second user group spends less time working with the GIS s. They maintain geographic information in order to have tools which help them in the decision making process. They have few opportunities for extensive training in the GIS tools, and consequently the GIS must be simple and easy to handle. 2.2 Data Mining Knowledge Discovery in Databases (KDD) is defined as the non-trivial extraction of implicit, previously unknown, and potentially useful information from data [16]. This is an interactive and iterative process that involves several phases: data preparation, search for patterns over the data, evaluation and interpretation of discovered patterns, and refinement of the whole process as it is shown in Figure

101 Data Clean data Integrated data Selected data Cleanning Integration Selection Transformation Patterns evaluation Data mining Knowledge Patterns Formatted and structured data Figure 2.2. Knowledge Discovery in Databases. The data mining phase (the search for patterns) is the nucleus of the process; it consists of the application of data analysis and discovery algorithms that, under acceptable computational efficiency limitations, produces a particular enumeration of patterns over the data [4, 12]. Data mining involves the integration of methods from different scientific fields such as machine learning, database technology, statistics, and visualization as shown in Figure 2.3. The first approaches developed focused on the discovery of knowledge from relational data. 13

102 Machine Learning Database Technology Statistics Data Mining Information Science Other Disciplines Visualization Figure 2.3. Data mining: integration of several fields. Several architectures have been proposed for data mining [22, 25]. Figure 2.4 presents an architecture based on the proposal of Han et al. [23]. The user is the trigger of the entire process and receptor of the discovered knowledge. Data may be fetched from several sources such as files, databases, and data warehouses using a data server module. The data mining engine may use one or more data mining techniques for searching patterns from data. The significance, importance and interestingness of the found patterns are evaluated by the pattern evaluation module. The data mining engine and pattern evaluation modules may use background knowledge stored in a knowledge database. The role of the graphical user interface is to receive the user requirements and to deliver the generated results. Since this is an iterative and interactive process, the components may interact among themselves. 14

103 U S E R S Graphical User Interface Pattern Evaluation Data Mining Engine Knowledge Database Data Server Data Data Data Database Data Warehouse Figure 2.4. Architecture for a KDD system Spatial Data Mining Climate change, natural risk prevention, human demography, deforestation, and natural resources atlas are examples from a large variety of issues arisen as a result of the interaction among people and their natural environment, the Planet Earth. Data generated from those issues are known as spatial data. Spatial data mining methods focuses on the discovery of implicit and previously unknown knowledge in spatial databases [16]. Spatial data have many features that distinguish them from relational data. For example, the spatial objects may have topological, distance, and direction information, the complexity, 15

104 and the query language used to access them. Different approaches have been developed for knowledge discovery from spatial data such as Generalization-based methods [21, 35], Clustering [26, 28, 37, 40, 46], Spatial associations [13, 29], Approximation and aggregation [27], Mining in image and raster databases [13, 14], Classification learning [31], and Spatial trend detection [10]. In the following subsections we present a brief description of these spatial data mining approaches Generalization-based Method Generalization has been shown to be one of the effective methods of discovering knowledge. It was introduced by the machine learning community and it is based on learning from examples. Generalization-based knowledge discovery requires concept hierarchies (given explicitly by experts or generated automatically). In the case of the spatial databases, there can be two types of concept hierarchies: Thematic hierarchies. We can generalize tomatoes and bananas as fruits, fruits and vegetables as cash crops. Spatial hierarchies. We can generalize some geographic points as a country or a region. The approach introduced by the machine learning community (tuple-oriented) cannot be directly adopted for large spatial databases because it does not handle very well the noise and inconsistent data, and the algorithms are exponential in the number of examples. Han et al. [21] present a modified technique named attribute-oriented induction for mining 16

105 relational data. Lu et al. [35] extended the attribute-oriented induction technique to spatial databases. Attributed-oriented induction is performed by climbing the generalization hierarchies and summarizing the general relationships between spatial and non-spatial data at a higher concept level. The authors present two generalization based algorithms: Non-spatial data dominant. This method performs attribute-oriented induction on the non-spatial attributes, first generalizing them to a higher concept level and later merges corresponding spatial attributes (using spatial merge and approximation). Spatial data dominant. Given the spatial data hierarchy, generalization can be performed first on the spatial data and then generalizing their corresponding nonspatial attributes. Both algorithms assume that the rules to be mined are general data characteristics (characteristics rules) and that the discovery process is initiated by the user who provides a learning request. A disadvantage in this approach is the case where a hierarchy may not exist or the hierarchy given by the experts may not be entirely appropriate in some cases. The quality of mined characteristics is highly dependent on the structure of the hierarchy Clustering Clustering is the process of grouping physical or abstract objects into classes of similar objects. Clustering analysis helps construct meaningful partitioning of large set of objects. It identifies clusters, or densely populated regions, according to some distance measurement in a multidimensional dataset. Given a set of multidimensional data points, the data space is usually not uniformly occupied by the data points. Data clustering 17

106 identifies the sparse and the crowded places, and hence discovers the overall distribution patterns of the dataset. We can classify clustering algorithms in four main approaches: Partitioning, Hierarchical, Locality-based and Grid-based algorithms. Partitioning algorithms partition a database of n objects into a set of k clusters which are represented by the gravity of the cluster (k-means algorithms) or by one representative object of the cluster (k-medoid algorithms). These algorithms use a two-step procedure. First, they determine k representatives, next, assign each object to the cluster with its representative closest to the considered object. Hierarchical clustering algorithms decompose the database into several levels of partitioning which are usually represented by a dendrogram. The algorithm iteratively splits the database into smaller subsets until some termination condition is satisfied. The dendrogram can either be created top-down (divisive) or bottom-up (agglomerative). Locality-based clustering algorithms group neighboring data elements into clusters based on local conditions and therefore allow the clustering to be performed in one scan of the database. Grid-based algorithms quantize the space into a finite number of cells and then do all operations on the quantized space. They frequently use hierarchical agglomeration as one of their processing phases. Classification of clustering algorithms is neither straightforward, nor canonical. Some algorithms perform clustering by combining techniques from these approaches. Important issues in clustering algorithms include the following properties: 18

107 The algorithm must be efficient (time complexity). Ability to handle noise (outliers). Non-sensibility to data input order. A Priori knowledge and parameters tuning. Ability to find clusters of arbitrary shape. Scalability to large databases. Ability to work with high dimensional data Spatial Associations A spatial association rule is a rule which describes the implication of one or a set of features by another set of features in spatial databases [29]. An example of a spatial association rule is if the company is close to México City then is a big company. A spatial association rule is of the form X Y, where X and Y are sets of spatial or nonspatial predicates. There are various kinds of spatial predicates that could constitute a spatial association rule. Some examples are topological relations such as intersects, overlaps, disjoint; and spatial orientations such as left_of, west_of. Koperski and Han [29] developed an algorithm for spatial associations rules in spatial databases. They use the concepts of minimum support and minimum confidence introduced by Agrawal et al. [1] to develop association rules from large transactional databases. The support of a pattern A in a set of spatial objects S is the probability that a member of S satisfies pattern A, and the confidence of A B is the probability that a pattern B occurs if 19

108 pattern A occurs. A user or an expert may assign thresholds to confine the rules to be discovered to be strong ones. Although many spatial association rules may exist in large databases, some of them may occur rarely or may not hold in most cases. In addition, such rules are usually not 100% accurate and may contain non trivial knowledge. The authors employ a method which uses a top-down progressive deepening search technique. The technique firstly searches at a high concept level for large patterns and strong implications relationships among the large patterns at a coarse resolution scale. Then only for those large patterns, it deepens the search to lower concept levels. Such deepening search process continues until no large patterns can be found. The search employed for large patterns at high concept levels is applied at a coarse resolution scale efficiently by using approximate spatial computation algorithms such as R-trees [20] or plane-sweep [39] techniques operating on minimum bounding rectangles (MBR). Only the candidate spatial predicates, which are detailed reviewed, will be computed by refined spatial techniques. Such multiple-level approach saves much computation because it is very expensive to perform detailed spatial computation for all possible spatial association relationships Approximation and Aggregation Clustering algorithms are effective and efficient methods for answering questions such as: where are the clusters in the spatial database? In some cases it is also important to answer the question why the clusters are there. We can rephrase the question as: what are the 20

109 characteristics of the clusters in terms of the features (objects) that are close to it? Knorr and Ng [27] present a study based on this question. The aggregate proximity is the measure of closeness of the set of points in the cluster to a feature as opposed to the distance between a cluster boundary and the boundary of a feature. Finding aggregate proximity relationships is not as simple as it may seem. There are three reasons: The sizes and shapes of the cluster and the features may vary greatly. There may be a very large number of features to examine. Even if a suitable feature (i.e., polygon) is found to describe the shape of the cluster of points, it is inappropriate to simply report those features whose boundaries are closest to the cluster s boundary, because the distribution of points in a cluster may not be uniform. The authors propose the use of computational geometry concepts to find out the characteristics of a given cluster in terms of the features close to it. They present the algorithm CRH (where C is for encompassing circle 1, R for isothetic rectangle 2, and H for convex hull 3 ) which uses concepts as filters to reduce the candidate features at multiple levels. In general, they collect a large number of features from multiples sources (i.e., 1 A circle that encloses a set of n points. Not necessarily minimum bounding. 2 A rectangle that is orthogonal to the coordinate axes. 3 This is the unique, minimum bounding convex shape enclosing a set of points. 21

110 maps) and feed them along with the cluster to the algorithm CRH and discover knowledge about spatial relationships. Approximation by circles and then by rectangles is used to eliminate features that have large aggregate distance to the cluster. After these filters, the algorithm calculates the aggregate proximity of points in the cluster to the convex boundary of each feature that passed through the previous filters. In the last step, the algorithm reports the features with the best aggregate proximities showing the minimum and maximum distances of points in the cluster to the feature, average distance, and percentages of points located in the distance less than specified threshold Mining an Image Database Knowledge mining from image databases can be viewed as a special case of spatial data mining. For example, Fayyad et al. present a system [14] to identify and categorize volcanoes on the surface of Venus from images taken by the Magellan spacecraft. Three basic components are implemented in the system: data focusing, feature extraction, and classification learning. The first component increases the overall efficiency of the system by first identifying the portion of the image being analyzed that is most likely to contain a volcano. They compare the intensity of the central pixel of a region to the estimated mean background intensity of its neighborhood pixels. The second component extracts interesting features from the data. The final component uses training examples provided by the experts to create a classifier that can discriminate between volcanoes and false alarms. For this task the authors implement decision trees. 22

111 Other studies of mining in image and raster databases are: Second Palomar Observatory Sky Survey [15], it uses decision trees for the classification of galaxies, stars and other stellar objects. Stolorz and Dean [43] proposed a system for detecting earthquakes from space. They combined methods of statistical interference, massively parallel computing, and global optimization to build the system that analyze tectonic activities with sub-pixel resolution over a large area. Stolorz et al. [44] and Shek et al. [41] carry out studies about fast spatio-temporal data mining from geophysical datasets; they described CONQUEST, a distributed parallel querying and analysis mining tool Classification Learning Spatial classification has as objective to find rules that divide a set of objects into a number of groups, where objects in each group belong mostly to one class. Many types of information can be used to characterize spatial objects. We can classify such information into non-spatial attributes of objects, spatially related attributes with non-spatial values, spatial predicates and spatial functions. Each of these categories may be used to extract both for class label attributes (attributes that divide data into classes) and predicting attributes (attributes on whose values the decision tree is branched). It is possible to use aggregate values for some of these attributes. Koperski et al. [31] proposed and evaluated a method for classification of spatial objects. The method enables classification of spatial objects based on aggregate values of non- 23

112 spatial attributes for neighboring regions. Spatial relations between objects on the map are taken into a count, which may be represented into the form of predicates Spatial Trend Detection Spatial trend detection is defined as a regular change of one or more non-spatial attributes in the neighborhood of some object in a database [10]. An example of spatial trend is as moving away from downtown Puebla, the price of land decreases. Neighborhood paths starting from some point x are used to model the movement and a regression analysis is performed on the respective attribute values for the objects of a neighborhood path to describe regularity of change. In the regression analysis the distance from x is the independent variable and the difference of the attribute values are the dependent variable(s). There are two types of trends: global trends and local trends. In the first one, the existence of a global trend for a start object x indicates that if considering all objects on all paths starting from x the values for the specified attribute(s) in general tend to increase or decrease with increasing distance. Local trends exist only in a particular direction. 2.3 Spatial relations The explicit location and extension of objects define implicit relations of spatial neighborhood. Therefore, the information about the neighborhood of spatial objects constitutes a valuable element that must be considered in the mining task. In the following 24

113 subsection we will present the Neighborhood Graphs, Neighborhood Paths and Neighborhood indices concepts that introduce us to the three types of spatial relations we propose to include in our model to represent together spatial data Neighborhood Graphs, Neighborhood Paths and Neighborhood Indices Martin Ester et al. [9, 11] introduce the concept of neighborhood graphs for explicitly representing those implicit neighborhood relations relevant for the KDD tasks. He claimed that attributes of the neighbors of some object of interest may have an influence on the object and therefore have to be considered as well. He commented that the efficiency of many KDD algorithms for spatial database systems depends heavily on an efficient processing of these neighborhood relationships. Neighborhood graphs may cover one of the following neighborhood relations: Topological. Distance. Direction. These relations are called binary relations since we can determine spatial relations between pairs of objects. A neighborhood graph G for some spatial relation neighbor is a graph where nodes are objects in the database and edges between nodes n 1 and n 2 represent the fact that the relationship neighbor (n 1, n 2 ) holds. The neighborhood path in the neighborhood graph G is 25

114 a set of nodes directly connected through the edges of the graph. The neighborhood index is a data structure that allows for an efficient execution of the operations for the construction of a graph and for browsing and expansion of paths. It stores all neighbors for the objects in a database Topological Relations Topological relations are those relations which are invariant under linear transformations, i.e., if both objects are rotated, translated or scaled simultaneously the relations are preserved. They present a definition of topological relations derived from the nine intersection model [6, 7, 8]. In the model the topological relations between two objects are defined in terms of the intersections of the interiors, boundaries and exteriors of the objects (see Figure 2.5). The interior of an object consists of points that are in the object but not on its boundary, and the exterior consists of those points that are not in the object (its complement). The Figure 2.6 shows the nine intersection model of two objects A and B: object A s interior (Aº), object A s boundary ( A) and object A s exterior (A ) with object B s interior (Bº), object B s boundary ( B) and object B s exterior (B ). 26

115 Boundary Aº Bº Aº B Aº B A Bº A B A B A Bº A B A B Interior Exterior Figure 2.5. Example of the interior, boundary and Figure 2.6. Nine intersection model. exterior of a circle. In Figure 2.7 we present the topological relations between two objects: Disjoint. The boundaries and interiors of the objects do not intersect. Contains. The interior and boundary of one object is completely contained in the interior of the other object. Inside. The opposite of contains. Equal. The two objects have the same boundary and interior. Touch. The boundaries of the objects intersect but the interiors do not intersect. Covers. The interior of one object is completely contained in the interior of the other object and their boundaries intersect. CoveredBy. The opposite of covers. Overlap. The boundaries and interiors of the two objects intersect. 27

116 Figure 2.7. Topological Relations Distance Relations Distance relations compare the distance between two objects with a given constant using arithmetic operators such as <, >, =. The distance between two objects is defined as the minimum distance between them (i.e., select all elements inside a radio of 50 km from a x point). Figure 2.8 shows two examples of this type of relation; in the figure we are representing how close and how far two objects are each other. Figure 2.8. Distance Relations. 28

117 We propose to use the Euclidian distance which is defined as the straight line distance between two points. In a plane with p1 at (x 1, y 1 ) and p2 at (x 2, y 2 ), the Euclidian distance is 2 ( x x ) + ( y ) y Direction Relations The direction relation R between two spatial objects A and B, (A R B) is defined using one representative point of object A and all the points of the destination object B. It is feasible to define several possibilities of direction relations depending on the number of points that are considered in the source and destination objects. The representative point of a source object may be the center of the object or a point on its boundary. This representative point is used as the origin of a virtual coordinate system and its quadrant defines the direction. The direction relations between two objects are North_of, South_of, East_of, West_of, Northeast_of, Northwest_of, Southeast_of, and Southwest_of. In Figure 2.9 we show some examples of direction relations, for instance, object D is South of object C and East of object A. 29

118 B B North of A C C Northeast of A A A West of D D D East of A D South of C Figure 2.9. Direction Relations. 2.4 Conclusion Data mining is a younger and promissory research field. Many of the data mining approaches developed for knowledge discovery in relational databases were extended to the spatial databases domain. In this chapter we presented several approaches such as generalization, clustering, spatial association, and spatial classifications. We have described three types of spatial relations that will be included in our model to represent as a unique graph-based dataset spatial data, non-spatial data and spatial relations. In the next chapter we will describe the model. First, we will talk about the graph-based knowledge discovery, next we will detail our methodology, and finally we will present an example showing its applicability and generated results. 30

119 Chapter 3 GRAPH-BASED REPRESENTATIONS This chapter describes our graph-based model to represent together spatial data, non-spatial data and spatial relations among the spatial objects. We propose to include three types of spatial relations: topological, distance and direction. Section 3.1 presents some issues related to the graph-based knowledge discovery. Our methodology is detailed in Section 3.2. In Section 3.3 we describe five graph-based representations based on our model. Finally, an example of the applicability of the proposal is presented in Section Generalities In Chapter 2 we presented several approaches developed to search knowledge in spatial databases. The differences between these approaches are based on the data representation and the data mining algorithm used in the search task. Certainly, the data representation used by a mining tool is very important, and it has to be powerful enough to represent domains containing complex relations among their components (i.e., spatial data domain). 31

120 A graph-based representation has these characteristics [3, 5, 17, 36]. It has the benefits of being easy to understand and flexible enough to create different representations of the same domain. The domain (i.e., data and relationships) is described using graphs. These graphs become the input to a graph-based discovery tool which uses a heuristic to choose the subgraphs that are considered important (patterns). A graph is defined as a pair G = (V,E). V = {v 1,,v n } denotes a finite set of elements called vertices. E is a set of edges e satisfying E [V] 2. Then, each edge e E is a pair (v i,v j ). If (v i,v j ) is an ordered pair for any (v i,v j ) E, then G = (V,E) is said to be a directed graph. A labeled graph has labels associated with its edges and vertices. In knowledge discovery systems using a graph-based approach, the data mining algorithm uses graphs to represent knowledge; this means that the data preparation phase includes the transformation of the data to a graph format. The search space of a graph-based data mining algorithm consists of all the subgraphs that can be derived from its input graph. In the literature there exist several definitions about what a spatial data is. However, before presenting our spatial model definition, we precise the following issues: When we speak about a geometric object we refer to an object describing a form (i.e., point, line, and polygon). A spatial object refers to an object represented by a coordinate system (i.e., Cartesian coordinates). 32

121 A geographic object may be considered a specialization of a spatial object because it is represented by a coordinate system but related to earthly coordinates (sometimes called Geodetic or Geographic coordinates). A model is a simplification of the reality. It is not the reality, rather it represents the reality. A model is used to explain or to understand the reality. A model can be, for instance, an equation, a hypothesis or a structured idea. A spatial model is therefore an abstraction of spatial data that generates useful information to help us understand, describe, and predict how things work and/or solve problems in the real world. When we work with geographic coordinates we can talk about a geographic model. 3.2 Methodology The idea is to create a graph-based model to represent together spatial data, non-spatial data and the spatial relations between spatial objects. We will generate datasets composed of graphs with a set of these three elements. We argue that by mining a dataset with these characteristics a data mining algorithm can search patterns involving all these elements, at the same time, improving the results of the spatial analysis process. For example, finding out interesting patterns of objects located at some distance from a particular point; focusing on a current problem such as finding risk zones near the Popocatépetl volcano (México). In addition, it would be important to know the characteristics of the evacuation routes which would be used in situations of volcanic 33

122 activity, i.e., what are the characteristics of the evacuation routes; could they withstand the atmospheric conditions and the passage of vehicles in an emergency situation? A significant characteristic of spatial data is that the attributes of the neighbors of an object may have an influence on the object itself. So, we propose to include in the model the three types of relationship mentioned in Chapter 2: topological, distance, and direction relations. As we mentioned, the basis is that in a spatial database there exist spatial objects, these objects interact with other objects (spatial relations) and they may have several attributes describing them (non-spatial data). Thus, we propose to create graphs that help us to describe these interconnections between all these elements. A simple way can be as follows: for each object we create a vertex representing the object itself and join two vertices with a directed labeled edge if they have a spatial relation in common (we add one edge for each spatial relation among vertices). Additionally, we can also create vertices representing the value of their non-spatial attributes and directed labeled edges joining each attribute to its object (one edge per attribute). But there is a higher complexity level when working with general graphs (a graph with multiple edges between vertices) than with simple graphs (a graph with at most one edge between any given pair of vertices and with no loops). Thus, our idea is to work with simple graphs. So, we propose to improve this representation to the one shown in Figure 3.1 (using UML notation). 34

123 Topological Distance Direction 1..* Spatial relation -From: -To: Edge Vertex Spatial object 1..* Non-spatial attribute Figure 3.1. General graph-based model to represent spatial data. In the model, the spatial data (i.e., spatial objects), non-spatial data (i.e., attributes), and spatial relations are represented as a collection of one or more directed graphs. Therefore, a directed graph contains a collection of vertices and edges representing all these elements. Vertices represent either spatial objects, spatial relation types between two spatial objects (binary relation), or non-spatial attributes describing the spatial objects. Edges represent a link between two vertices of any type. According to the type of vertices that an edge joins, an edge can represent either an attribute name or a spatial relation name. The attribute name can refer to a spatial object or a non-spatial entity. We use directed edges to represent directional information of relations among elements (i.e., object x covers object y) and to describe attributes about objects (i.e., object x has attribute z). 35

124 This knowledge representation has the capability to describe a spatial dataset using graphs, allowing a graph-based mining tool to mine it as a whole. The capabilities of the model to represent the relation between these objects will be of great impact in the results of the data mining processes, since the world is described by objects and the relation between these objects, we can figure out the relations as the elements describing the interaction of the objects with each other. 3.3 Spatial Graph-based Data Representations In the construction of the graph there are issues such as the graph complexity and size that have a direct impact over the data mining algorithm performance. The quality of results refers to another important aspect. Testing several representations will allow us to produce comparisons among obtained results, to evaluate them, and finally to make a decision for selecting the one(s) which offers the better results according to our criteria for success. One approach is to show that the model allows discovering known patterns. Another approach is to have a domain expert saying that the discovered patterns are interesting. We have worked with domain experts for developing these tasks. Currently, we have developed five models to represent spatial data, non-spatial data and spatial relationships among the spatial objects as a unique dataset from the general model. Three issues define the characteristics (i.e., number of vertices and edges, simple graph, etc.) of the graphs created from these models: Equivalent spatial relations. 36

125 Symmetric spatial relations. The way to represent objects and their relations in the model. In the following subsections we will explain how these three characteristics affect the structure and composition of the graph. First, we will talk about the equivalent relations, next the symmetric relations and finally the five created models. Equivalent Spatial Relations Suppose that our dataset is composed of an object A disjoint of an object B, this implies that object B is disjoint of object A. In this case the two objects are disjoint each other (an equivalent relation). When creating the graph, this relation can be represented by two directed edges, one edge labeled as DISJOINT from object A to object B and vice versa. However we can use the following principle: 2 directed edges, 1 edge e(v i,v j ) and 1 edge e(v j,v i ) equal to 1 undirected edge e = ij By applying this principle we can replace the two directed edges by only one undirected edge labeled as the equivalent relation without losing the representation of the spatial relation among the objects. The equivalent relations implemented in our research are the following: TOUCH DISJOINT 37

126 OVERLAP EQUAL CLOSE Symmetric Spatial Relations Suppose that our dataset is composed of an object A South of an object B, when we create the graph this relation is represented by a directed edge from object A to object B labeled as South_of. But it implies the relation object B is North of object A (it is a symmetric relation). This last relation in some models is not represented. According to the model the representation of a symmetric relation implies one of the following options: The addition of a directed edge labeled as the symmetric relation. The addition of a vertex labeled as the spatial relation (i.e., topological, direction) and a directed edge labeled as the symmetric relation. The symmetric relations implemented in our research are the following: CONTAINS INSIDE COVERS COVEREDBY NORTH_OF SOUTH_OF EAST_OF WEST_OF NORTHEAST_OF SOUTHWEST_OF SOUTHEAST_OF NORTHWEST_OF 38

127 Models As we have commented, testing several representations will allow us to produce comparisons among obtained results, to evaluate them, and finally to make a decision for selecting the one(s) which offers the better results (in term of quality). The following five models were developed to answer issues such as: in some models we obtain a reduction in the number of vertices and edges, but do they give us the same representation of our data? Are we gaining in the size reduction, but what are we loosing? There is a higher complexity working with general graphs than with simple graphs, does it affect the generated results? What about time processing? In order to describe the characteristics of each model we will use the sample dataset shown in Figure 3.2. Our dataset is composed of two spatial objects, object A representing a house and object B representing a lake, and the following three spatial relations among them: Distance relation o Object A close object B (equivalent relation). Topological relation o Object A touch object B (equivalent relation). Direction relation o Object A South of object B (symmetric relation). 39

128 North Object B West East Object A South Figure 3.2. Sample dataset. Additionally, to evaluate the characteristics of each model (the model is itself a graph) we have developed the following nine evaluation metrics: Num. vertices. Total number of vertices in the graph. Num edges. Total number of edges in the graph. Size (vertices + edges). Total number of vertices plus total number of edges in the graph. % increment. This item represents the percent of increment (in term of vertices plus edges) of this graph with respect to the graph created by using the base model. Simple graph. To indicate if the graph is a simple one (graph with at most one edge between any given pair of vertices and with no loops). Directed edge. To indicate if the graph has directed edges. Undirected edge. To indicate if the graph has undirected edges. Complete information. The item shows if the symmetric relations, created from the original spatial relations, are also represented in the graph. 40

129 Redundant Relation edge. We introduce this edge as strategy for avoiding creating complex graphs. Remember that we consider simple graph a graph with at most one edge between any given pair of vertices and with no loops. As we will see in the description of those models, we use this special edge to join the vertices representing spatial object with the vertices representing the spatial relation types (i.e., topological, distance and direction). In models #3, #4 and #5 we add a Relation edge for representing explicitly the fact that there is a relation between two spatial objects. This metrics tells us if a redundant Relation edge is presented in the graph. Model #1 - base model Figure 3.3 shows the first model created for representing spatial data as proposed. The characteristics of the model according to the metrics are: Num. vertices: 2 vertices for representing each spatial object (i.e., object A, and object B). Num. edges: 4 edges, 3 edges for representing the original spatial relations (i.e., close, touch, and South_of relations) and 1 edge for representing the North_of relation created from the original South_of symmetric relation. The North_of relation is itself a symmetric relation. Size (vertices + edges): 6 % increment: 0%, it is the base model. Simple graph. No, it is a complex graphs with 4 edges linking 2 vertices. 41

130 Directed edge. Yes, they are used for representing the South_of and North_of symmetric relations. The direction of the edges is according to the lecture of the relations among the objects. Undirected edge. Yes, they are used for representing the close and touch equivalent relations. Complete information. Yes, in the graph is represented the North_of symmetric relation created from the original South_of symmetric relation. Redundant Relation edge. No, in the model we do not use Relation edges. South_of Spatial object A close touch Spatial object B North_of Figure 3.3. Model #1 - base model. Model #2 - single replication of relation types, complete information In Figure 3.4 we show our second model created for representing spatial data. The characteristics of the model according to the metrics are: Num. vertices: 5 vertices, 2 vertices for representing the spatial objects and 3 vertices for representing the topological, distance, and direction spatial relation types. For each spatial relation type among two spatial objects we add 1 vertex labeled as its name. In the example there exist 1 topological relation, 1 distance relation, and 1 direction relation. 42

131 Num. edges: 6 edges, 3 edges for representing the original spatial relations, 2 edges for representing the equivalent relations (i.e., close and touch relations) created from the original ones, and 1 edge for representing the symmetric relation (i.e., North_of relation) created also from the original relations. Size (vertices + edges): 11 % increment: % Simple graph. Yes, there exists at most 1 edge between any given pair of vertices. Directed edge. Yes, they are used for representing all relations. The direction of the edges is from the vertices representing the spatial objects to the vertices representing the spatial relation types. Undirected edge. No, in the model we do not use undirected edges. Complete information. Yes, we represent the symmetric relations created from the original ones. Redundant Relation edge. No, in the model we do not use Relation edges. Distance close close Spatial object A touch Topological touch Spatial object B South_of North_of Direction Figure 3.4. Model #2 - single replication of relation types, complete information. 43

132 Model #3 - double replication of relation types, no complete information In Figure 3.5 we show our third model created for representing spatial data. The characteristics of the model according to the metrics are: Num. vertices: 8 vertices, 2 vertices for representing the spatial objects and 6 vertices for representing the topological, distance, and direction spatial relation types. For each spatial relation type among two spatial objects we add 2 vertices labeled as its name. For instance, in the example there exist three relations: 1 topological relation, 1 distance relation, and 1 direction relation, so we add 6 vertices, 2 per each spatial relation type. Num. edges: 9 edges, 6 edges (the Relation edges) to link the vertices representing the spatial objects to the vertices representing the spatial relation types (from each vertex representing a spatial object start 3 edges since we have 3 relations), and 3 edges for representing the original spatial relations. These last 3 edges are used to join the vertices representing the spatial relation types. Size (vertices + edges): 17 % increment: % Simple graph. Yes, there exists at most 1 edge between any given pair of vertices. Directed edge. Yes, they are used for representing the symmetric relations and the Relation edges. The direction of the Relation edges is from the vertices representing the spatial objects to the vertices representing the spatial relation types. The direction of the other edges is according to the lecture of the relations among the objects. Undirected edge. Yes, they are used for representing the equivalent relations. 44

133 Complete information. No, we do not represent the symmetric relations created from the original ones. Redundant Relation edge. Yes, we use in the model Relation edges for representing explicitly the existence and type of a relation among 2 spatial objects. Additionally, we use this edge for avoiding creating complex graphs. Spatial object A Spatial object B Relation Relation Relation Distance close Distance Relation Relation Relation Topological touch Topological Direction South_of Direction Figure 3.5. Model #3 - double replication of relation types, no complete information. Model #4 - single replication of relation types, no complete information In Figure 3.6 we show our fourth model created for representing spatial data. The characteristics of the model according to the metrics are: Num. vertices: 5 vertices, 2 vertices for representing the spatial objects and 3 vertices for representing the topological, distance, and direction spatial relation types. For each spatial relation type among two spatial objects we add 1 vertex labeled as its name. In the example there exist 1 topological relation, 1 distance relation, and 1 direction relation. 45

134 Num. edges: 6 edges, 3 edges (the Relation edges) to link a vertex representing a spatial object (in the example we have 2 vertices since there are 2 spatial objects) to the vertices representing the spatial relation types (from this vertex representing a spatial object start 3 edges since we have 3 relation types), and 3 edges for representing the original spatial relations. These last 3 edges are used to link the vertices representing the spatial relation types to the un-used vertex representing the other spatial object. Size (vertices + edges): 11 % increment: % Simple graph. Yes, there exists at most 1 edge between any given pair of vertices. Directed edge. Yes, they are used for representing the symmetric relations and Relation edges. The direction of the Relation edges is from a vertex representing a spatial object to the vertices representing the spatial relation types. The direction of the other edges is from the vertices representing the spatial relation types to the un-used vertex representing the other spatial object (it is according to the lecture of the relations among the objects). Undirected edge. Yes, they are used for representing the equivalent relations. Complete information. No, we do not represent the symmetric relations created from the original ones. Redundant Relation edge. Yes, we use in the model Relation edges for representing explicitly the existence and type of a relation among 2 spatial objects. Additionally, we use this edge for avoiding creating complex graphs. 46

135 Distance Relation close Spatial object A Relation Topological touch Spatial object B Relation South_of Direction Figure 3.6. Model #4 - single replication of relation types, no complete information. Model #5 - double replication of relation types, complete information In Figure 3.7 we show our fifth model created for representing spatial data. The characteristics of the model according to the metrics are: Num. vertices: 8 vertices, 2 vertices for representing the spatial objects and 6 vertices for representing the distance, topological, and direction spatial relation types. For each spatial relation type we add 2 vertices labeled as its name. In the example we add 2 vertices for the topological relation, 2 vertices for the distance relation, and 2 vertices for the direction relation. Num. edges: 12 edges, 6 edges (the Relation edges) to link the vertices representing the spatial objects to the vertices representing the spatial relation types, and 6 edges for representing the original spatial relations and those ones generated from them (equivalent and symmetric relations). These last 6 edges are used to link the vertices representing the spatial relation types to the vertices representing each spatial object (3 edges for each spatial object). Size (vertices + edges): 20 47

136 % increment: % Simple graph. Yes, there exists at most 1 edge between any given pair of vertices. Directed edge. Yes, they are used for representing all relations. Undirected edge. No, in the model we do not use undirected edges. Complete information. Yes, we represent the symmetric relations created from the original ones. Redundant Relation edge. Yes, we use in the model Relation edges for representing explicitly the existence and type of a relation among 2 objects. Additionally, we use this edge for avoiding creating complex graphs. Spatial object A touch touch Spatial object B Relation Relation Relation North_of Topological close close South_of Topological Relation Relation Relation Distance Distance Direction Direction Figure 3.7. Model #5 - double replication of relation types, complete information. Table 3.1 presents the results of the nine metrics developed to evaluate the characteristics of each model. Model #1 is named the base model. 48

137 Model Num. Num. Size % Simple Directed Undirected Complete Redundant Vertices Edges (v + e) Increment Graph Edge Edge Information Relation Edge (1) (2) (3) (4) (5) (6) (7) (8) (9) # No Yes Yes Yes No # Yes Yes No Yes No # Yes Yes Yes No Yes # Yes Yes Yes No Yes # Yes Yes No Yes Yes Table 3.1. Characteristics of the graph-based representation models. The metrics were proposed based on the causes/effects each of them has both into the created graph and the mining algorithm. We visualize four significant issues related directly to these metrics: 1. Search space The search space of a graph-based data mining algorithm consists of all the subgraphs that can be derived from its input graph, thus, the number of vertices (1) and edges (2) of the created graph (3) are two important issues defining the search space size 1 for the discovery system. Therefore, the objective must be to minimize the number of vertices and edges used to create the graphs but at the same time to maximize the representativeness of the dataset. As we can see in Table 3.1, the model using the minimum number of vertices and 1 The edge type (directed or undirected) and graph type (simple or complex) are issues that also define the search space of a graph-based data mining algorithm. In Chapter 4 we describe the Subdue system, our graphbased data mining tool, and we show the way Subdue deals with these issues. 49

138 edges to represent our sample dataset is model #1 (2 vertices and 4 edges) whereas model #5 is the opposite case (8 vertices and 12 edges). 2. Processing time The search space size plays a relevant role regarding to the processing time used to discover patterns. If we have a large search space the algorithm would require more time to evaluate all the possible subgraphs. Therefore, a comparison among the percentage of increment (4) metric of the proposed models is presented in Table 3.1. Remember this metric compares the size of a given model with respect to model #1 (base model). For instance, model #5 has a graph size increment of % with respect to model #1. In other words, the mining algorithm will require the evaluation of % more vertices and/or edges by using model #5 instead of model #1 for the same dataset. 3. Graph complexity Next chapter describes the Subdue system, our graph-based data mining tool. As we will see there exists a higher complexity for the mining algorithm to work with complex graphs than with simple ones (i.e., at most an edge among any given pair of vertices and with no loops). For instance, in the graph match process, in the expanding phase (Subdue uses an expanding approach to discover patterns), and in the graph compression stage. Therefore, the objective was to propose graph-based models that allow us to create simple graphs. As we can see in Table 3.1, only model #1 does not create simple graphs. 50

139 Thus, as a strategy to break the multiplicity of edges among two vertices (for instance, vertices representing two spatial objects meeting two or more spatial relations among them) we use the following approaches: To add a new vertex labeled as the relationship type (i.e., topological, distance, and direction) for each spatial relation among the spatial objects. This approach is used in model #2. To add a new vertex (model #4) or two new vertices (model #3 and model #5) labeled as the relationship type (i.e., topological, distance, and direction) by each spatial relation among the spatial objects, and to link this new vertex (model #4) or new vertices (model #3 and model #5) with the vertices representing the spatial objects by using edges labeled as Relation. The approach used to link the vertices is different in each model as we have mentioned in the definition of each of them. This nomenclature is used to represent the fact that there exists a spatial relation among these spatial objects. These edges are known as redundant relation edges (9). 4. Data representativeness The directed edges (6), undirected edges (7), and complete information (8) metrics are used to maximize the representativeness of the dataset and minimizing, as most as possible, the graph size and its complexity. Directed edges are used to represent the symmetric spatial relations (object A North_of object B, implies, B South_of A), the redundant relation edges, and the non-spatial attributes describing the spatial objects. Undirected edges are used to represent equivalent spatial relations (the relation is represented by an undirected edge 51

140 instead of two directed edges). Finally, complete information means that symmetric spatial relations among spatial objects are also represented into the model. 3.4 Use-case To show the applicability of our model, we will use a dataset from the Popocatépetl volcano database [38, 42]. The database contains data related to several issues in the zone such as settlements, rivers, and evacuation roads in the zone, just to mention a few of them. Figure 3.8 shows a fragment of the layers roads and settlements of the Popocatépetl volcano zone. The roads layer (shown in green color) represents the roads in the zone; it is composed of spatial objects (i.e., lines) and non-spatial data describing the characteristics of those roads (i.e., id, start point, end point, length, and type). The settlements layer (shown in pink color) represents population areas in the zone. The layer is composed of spatial objects (i.e., polygons) and non-spatial data describing the characteristics of the settlements (i.e., id, area, perimeter, and type). 52

141 Figure 3.8. Spatial database representing some objects of the world. Suppose we are interested in the identification of relations among the roads starting in, or crossing a settlement (i.e., the type and characteristics of roads in relation to the number of people living in a settlement that in case of a volcanic contingency are required to be evacuated). So, we need to create a dataset involving these elements for mining it and search for patterns that could help us to evaluate the characteristics of the roads and, may be, to make decisions for improving them (or build new ones) for their utilization in case of volcano activity. In this case we are working with different types of spatial objects and also we are adding non-spatial data and relationships (as touch or overlap) to our dataset. The construction of the dataset needs to satisfy some constraints. For example, if we include all the elements existing in the data layers we might build a huge graph, and this will have a direct impact in the data mining algorithm. So, we explore methodologies to deal with topics such as complexity, the size of the graph, noise and quality of the data. A solution for the problem of creating a huge graph is to limit the set of elements to be 53

142 included in it by using selection windows. The user can create these windows and only the elements inside them are candidates to be included as objects in the graph. The idea is shown in Figure 3.9. Suppose the user will work with n data layers, thus, for each layer we will select only the spatial objects inside the area delimited by the selection window. We can see this functionality such as a drill operation over all the spatial layers the user works with. Figure 3.9. Selection window. Once we have defined the working area, as shown in Figure 3.8, we can build the graph. The process consists of three phases. In the first phase, the user has to choose the spatial relation(s) to evaluate among the spatial objects (in the example the touch or overlap topological relations). The second phase involves the validation process, only the objects covered by the relation(s) become elements to be included in the graph. 54

143 Figure Querying a spatial database. The last phase consists of building the graph using the results of the validation process. Figure 3.10 presents an example of this functionality. This time only the area delimited by a selection window is shown. In the figure, the spatial objects satisfying either the touch or overlap relations are shown in blue color, the rest of the objects are shown in their original colors. The circles show some examples where a road is starting or crossing a settlement. 55

144 Figure Graph-based representation for spatial data. 56

145 Figure 3.11 shows the graph created using model #2 (we used the Graphviz System [19] to draw the graph). The vertices in the graph represent either spatial objects (i.e., road, settlement), spatial relations between two objects (i.e., topological), or attributes describing the objects (i.e., ID value). In this example, we use a special vertex labeled as object for expressing the interconnection between the spatial objects, their spatial relations, and nonspatial attributes (i.e., object type road with ID 989 overlaps or touches with object ). According to the type of vertices that an edge joins, the edge can be labeled as either a spatial relation name (i.e., overlap, touch), or as the name of an attribute describing the characteristics of an object (i.e., type, ID). We may read the graph as follows: there are six roads and six settlements inside the working area meeting either a touch or overlap relation in the form road settlement. We suppose that each polygon object in the map represents a settlement. Some roads start from a settlement and others cross a settlement. For example, the road with ID 981 crosses the settlements with ID s 3079, 3139, 3151, 3083 and This means that we need to be careful with this road since it has interaction with several settlements and in case of a contingency it will be widely used. In the graph we included only the object's ID nonspatial attribute for each object. This is a simple example; in the real world the spatial data layers may have hundreds or thousands of objects, and each object may have dozens of attributes describing it. Joining all these elements will allow creating large graphs representing the elements found in a 57

146 spatial database improving the results of the data mining task. Once we have created the graph, it will be used as input to a graph-based mining system. 3.5 Conclusion Our idea is to propose a graph-based model to represent together spatial data, non-spatial data and the spatial relations between spatial objects. Based on the model we generate datasets composed of graphs with a set of these three elements. Our argumentation is that by mining a dataset with these characteristics a data mining algorithm can search patterns involving all these elements at the same time improving the results of the spatial analysis process. We have presented an example of the applicability of the proposal using data from a Popocatépetl volcano database. The created graph will be the input for a graph-based mining system. We propose to use the Subdue system, which will be described in the next chapter. 58

147 Chapter 4 MINING THE GRAPH This chapter describes the characteristics of the Subdue system [24, 45], our data mining tool. Subdue was developed at the University of Texas at Arlington and it can be applied to any domain that can be represented as a graph. In the first part of the chapter we present the general characteristics and main functions. In the second part we describe the overlap feature and its importance for the pattern discovery system, and introduce a new algorithm named Limited Overlap. 4.1 Characteristics The Subdue system discovers substructures using a graph-based representation of structural databases. Structural data involves the concept of units or parts and the interdependence and relationships of those parts. The substructures (a connected subgraph within the graphical representation) describe concepts in the data (i.e., patterns). 59

148 The substructure discovery system represents structural data as a labeled graph. The input to Subdue is a graph where labeled vertices correspond to objects in the data, and directed or undirected labeled edges map relationships between objects. The discovery algorithm follows a computationally constrained beam search. There are three different evaluation methods to guide the search towards more appropriate substructures: Minimum Description Length (MDL), Size-based, and Set Cover. The default evaluation method is MDL. The algorithm begins with the substructure matching a single vertex in the graph. During each iteration, the algorithm selects the best substructure and incrementally expands the instances of the substructure. An instance of a substructure in the input graph is a subgraph that matches (graph theoretically) that substructure. The algorithm searches for the best substructure until all possible substructures have been considered or the total amount of computation exceeds a given limit. Evaluation of each substructure is determined by how well the substructure compresses the description length of the input graph. There might be slight variations of some substructures that can be considered as instances of another substructure. Subdue uses an inexact graph match algorithm to identify this kind of instances. In this inexact match approach, each distortion of a graph is assigned a cost and if the total cost is lower than a given threshold, the substructure is considered an instance of the other substructure. 60

149 A distortion is described in terms of transformations such as the deletion, insertion, and substitution of vertices and edges. The best substructure found by Subdue can be used to compress the input graph, which can then be input to another iteration of Subdue. After several iterations, Subdue builds a hierarchical description of the input data where later substructures are defined in terms of substructures discovered on previous iterations. Figures 4.1, 4.2, 4.3 and 4.4 show an example of the system s functionality. The example is presented in terms of the house domain, where a house is defined as a triangle on a square. T represents a triangle, S a square, C a circle and R a rectangle (see Figure 4.1). The objects in the figure (i.e., T1, S1, R1) become labeled vertices in the graph, and the relationships (i.e., on(s1, R1), shape(c1, circle)) become labeled edges. Input Database Input Graph T1 S1 R1 C1 object shape triangle T2 S2 T3 S3 T4 S4 on object shape square on object shape circle on object shape rectangle on on on object shape triangle object shape triangle object shape triangle on on on object shape square object shape square object shape square Figure 4.1. Graph representation of the house domain. 61

150 The graph representation of the substructure discovered by Subdue from this data is shown in Figure 4.2 where Subdue found four instances of triangle on square. Substructure Instance 1 Instance 2 object shape triangle T1 S1 T2 S2 on Instance 3 Instance 4 object shape square T3 S3 T4 S4 Figure 4.2. Substructure and instances discovered from the house domain by Subdue. After a substructure is discovered, each instance of the substructure in the input graph is replaced by a single vertex representing the entire substructure. In Figure 4.3 the discovered substructure (object shape triangle on object shape square) is labeled as SUB_1 (SUBstructure number 1). object shape triangle on SUB_1 object shape square Figure 4.3. Substructure replacement procedure in the house domain. Finally, the substructure (labeled as SUB_1) is used to compress the original input graph, which can then be input to another iteration of Subdue (see Figure 4.4). 62

151 SUB_1 on on object shape circle object shape rectangle on on on SUB_1 SUB_1 SUB_1 Figure 4.4. Graph representation of the house domain after substructure replacement. The Subdue system s ability to perform discovery has been proved in several domains including scene analysis, chemical analysis and CAD circuit analysis. In [18] the authors present an example of the Subdue capabilities to perform data mining tasks, they work with an earthquake database and show that Subdue is capable of finding not only the shared characteristics of the earthquake events, but also space relations between them. In the case of the identification of shared characteristics, they used the pattern containing the region number specification to recognize the area being studied; in the case of space relations, they found patterns that represent parts of the paths of the involved fault (i.e., subarea with a high concentration of earthquakes) Main Functions In this subsection we present a briefly description of the main functions composing the core of Subdue. Each of them is itself integrated by several subfunctions, but the idea is to present them in a global perspective. 63

152 Compress The compress function returns a new graph, which is the given graph compressed with the given substructure instances. Vertices SUB replace each instance of the substructure, and OVERLAP edges are added between vertices SUB of overlapping instances. Edges connecting to overlapping vertices are duplicated, one per each instance involved in the overlap. Discover This function plays the role of manager in the phase of discovering the best substructures in an input graph. It is in charge of issues such as getting initial substructures, extending each substructure, evaluating each extension, and adding to a final list the best discovered substructures. Evaluate This function implements the different evaluation methods used to guide the search towards more appropriate substructures: Minimum Description Length (MDL), Size-based, and Set Cover. The value of a substructure s in a graph g is computed as: size(g) / (size(s) + size(g s)) The value of size() depends on whether we are using the MDL or Size-based evaluation method. If MDL is used, then size(g) computes the description length in bits of g. In case of Size-based, then size(g) is simply vertices(g) + edges(g). The size(g s) is the size of graph g after compressing it with substructure s. Compression involves replacing each 64

153 instance of s in g with a new single vertex and reconnecting edges external to the instance to point to the new vertex. Extend This function returns a list of substructures representing extensions to the given substructure. Extensions are constructed by adding an edge (or edge and a new vertex) to each positive instance of the given substructure in all possible ways according to the graph. Matching extended instances are collected into new extended substructures, and all such extended substructures are returned. Graphmatch The Graphmatch function returns true if graph_1 and graph_2 match with cost less than the given threshold. If so, the objective is to store the match cost in the variable pointed to by matchcost and to store the mapping between graph_1 and graph_2 in the given array is non null. Subgraph Isomorphism Set of functions implementing a subgraph isomorphism algorithm. The objective is to find predefined substructures: searches for subgraphs of graph_2 that match graph_1 and returns the list of such subgraphs as instances of graph_2. Returns an empty list if no matches exist. This procedure mimics the discover substructures loop by repeatedly expanding instances of subgraphs of graph_1 in graph_2 until matches are found. 65

154 4.2 Overlap Now, we will describe the overlap feature in the Subdue system. Our goal is to present the current implementation of this feature in Subdue and then, in the next section, to describe our new proposal named limited overlap. As we will see, Subdue reports in the results all the overlapping substructures that it can find. Sometimes this is very expensive in time and in the number of substructures that the end user has to analyze. Therefore, the idea is that Subdue can find overlapping substructures only over a set of predefined vertices that are chosen by the user (the limited overlap). The overlap has a preponderant role in the substructure discovery system; it is controlled by the overlap user s parameter. If overlap is false then overlap among instances is not allowed, otherwise, overlap is allowed. To illustrate the cause/effect of the overlap in Subdue, we will present some examples generated by using two standalone functions belonging to Subdue: the Subgraph Isomorphism and Minimum Description Length functions. The generated results by these functions are affected by the value of the overlap parameter. Subgraph Isomorphism (SGISO) As we have already mentioned, the subgraph isomorphism function searches for subgraphs of graph_2 that match graph_1 and returns the list of such subgraphs as instances of graph_2. The subgraph isomorphism function is embodied in the FindInstances function. The procedure is optimized toward graph_1 being a small graph, and graph_2 being a large graph. 66

155 The FindInstances function starts finding a list of single-vertex instances, one for each vertex in graph_2 that matches the first vertex of graph_1. Next, the algorithm attempts to extend each instance in the instance list by an edge (or edge and new vertex) from graph_2 that matches the attributes of the given edge in graph_1. Finally, the instances not matching graph_1 are filtered, and an overlapping instances validation process is performed. If overlap is false overlapping instances are discarded. The resulting list represents the instances of graph_1 in graph_2. For illustrative purpose suppose we have as input graphs those shown in Figure 4.5 and Figure 4.6. Input graph_1 has 3 vertices and 2 edges, and graph_2 has 8 vertices and 7 edges. 1 A a b B C g F A a c E B a b b C B C 5 6 f 7 D Figure 4.5. SGISO - input graph_1. Figure 4.6. SGISO - input graph_2. Running SGISO with the overlap parameter set to false, it finds 1 instance of graph_1 (see Figure 4.7) in graph_2. Figure 4.8 shows the discovered instance in graph_2. 67

156 B a 1 A b C 2 6 Figure 4.7. SGISO - no overlap. Figure 4.8. SGISO - no overlap, 1 instance in graph_2. Now, running SGISO with the overlap parameter set to true, the algorithm finds the 4 instances shown in Figure 4.9. We can see that each instance has the vertices labeled as A, B, and C but they represent different vertices (they have different ID vertices). For example, the first instance has the ID vertices 1, 5 and 3; the second instance has the ID vertices 1, 5 and 6 respectively, and so on. Figure 4.10 shows the 4 discovered instances in graph_2. Instance 1 Instance A A a b a b B C B C Instance 3 Instance 4 1 A 1 A a b a b B C B C Figure 4.9. SGISO - overlap. Figure SGISO - overlap, 4 instances in graph_2. 68

157 Minimum Description Length (MDL) The MDL principle states that the best theory to describe a set of data is that theory which minimizes the description length of the entire dataset. We define the MDL of a graph to be the number of bits necessary to completely describe the graph. The minimal encoding of the graph in terms of bits is computed as follows: MDL = vertexbits + rowbits + edgebits Where vertexbits represents the number of bits needed to encode the vertex labels of the graph, rowbits represents the number of bits needed to encode the rows of the adjacency matrix A (the matrix represents the graph connectivity), and edgebits represents the number of bits needed to encode the edges represented by the entries A[i,j] = 1 of the adjacency matrix A. The standalone MDL function computes the description length (dl) of graph_1, graph_2 and graph_2 compressed with graph_1 along with the final MDL compression measure: dl ( graph_ 2)/ dl( graph_1) + dl( graph_ 2 graph_1) dl(graph_2 graph_1) represents the value of graph_2 compressed with graph_1. The overlap feature has a direct effect to compute this value because the number of instances for compressing graph_2 is based in the instances list returned by the FindInstances function. As we have commented, the Subdue system implements a compress graphs function named CompressGraph. The inputs of the function are the graph to be compressed, and the 69

158 substructure instances used to compress the graph. The function returns a new graph, which is the given graph compressed with the given substructure instances. As part of the graph compressing phase, SUB vertices replace each instance of the substructure, and OVERLAP edges are added between SUB vertices of overlapping instances. Edges connecting to overlapping vertices are duplicated, one per each instance involved in the overlap. The procedure to add OVERLAP edges evaluates two cases: first, if two instances overlap at all, then an undirected OVERLAP edge is added between them. Second, if an external edge points to a vertex shared between multiple instances, then duplicate edges are added to all instances sharing the vertex. The procedure to add duplicate edges is based in the follows pseudocode: if edge connects SUB1 to external vertex then add duplicate edge connecting external vertex to SUB2 else if edge connects SUB1 to another (or same) vertex in SUB1 then add duplicate edge connecting SUB1 to SUB2 if other vertex is unmarked (overlapping and already processed) then add duplicate edge connecting SUB2 to SUB2 if edge connects SUB1 to the same vertex in SUB1 (self edge) then add duplicate edge connecting SUB2 to SUB2 if edge directed then add duplicate edge from SUB2 to SUB1 To show the role plays by overlap in the standalone MDL function (focusing in the generated compressed graph) we will present, in the following figures, examples of the different cases of validation that are implemented to face this feature. Figure 4.11 shows the input graph_1, this graph will be used in all the examples as the first input graph. It has 3 vertices and 2 edges. 70

159 B a 1 A b C 2 3 Figure MDL - input graph_1. In our first example we will use as graph_2 the graph shown in Figure This graph has 8 vertices and 7 edges. As we can see our graph has a remarked edge, the directed edge labeled as g between vertex 1 (labeled as A) and vertex 8 (labeled as F). This edge represents the current case of validation (in the future it will be named the guide edge). As part of the CompressGraph algorithm, there is a function named AddOverlapEdges that adds edges to the compressed graph, describing overlapping instances of the substructure in the given graph, based in two conditions. First, if two instances overlap, then an undirected edge OVERLAP is added between them. Second, if an external edge points to a vertex shared between multiple instances, then duplicate edges are added to all instances sharing the vertex. If the second condition is true, the AddOverlapEdges function calls a new function named AddDuplicateEdges. The purpose of this function is to add duplicate edges based on overlapping vertex between substructures. As we can see, the layout of the edge (that we call the guide edge) is important for the global functionality of the compress graph algorithm. So, in the following examples we will change the guide edge for illustrating the different validation cases. 71

160 Returning to our example, if we run the MDL program with the overlap parameter set to false, our final compressed graph is shown in Figure Since overlap is not allowed between instances, the FindInstances function finds 1 instance of graph_1 that match graph_2. In the graph this instance was replaced by a vertex SUB, so we have only 1 vertex of this type. There are not edges OVERLAP, and no edges were duplicated. B a 8 1 g b F A C b a c E B C f 7 D Figure MDL example 1 - input graph_2. Figure MDL example 1 - no overlap. In the next example our graph_2 is shown in Figure It is the same graph used in the previous example, but this time we run MDL with the overlap parameter set to true. The generated compressed graph is shown in Figure The FindInstances function finds 4 instances since the overlap between instances is allowed. In the graph the 4 instances were replaced by 4 vertices SUB. There are 5 edges OVERLAP, an edge OVERLAP between two vertices SUB tells us that these substructures overlap. By substructures overlap or overlapping substructures we refer that the instances they represent have common vertices. In the example, there is a directed edge (edge g, our guide edge) connecting two vertices, one of them belonging to a substructure (in exact term, to an instance of a substructure) and the other one no belonging to the substructure (vertex F), is an external edge; the 72

161 discovered instances were 4 (overlap is allowed), so, in the graph there are 4 directed edges starting from each vertex SUB (the direction of the edge is preserved) to the vertex F. A similar procedure was implemented for edge f (connecting vertex C and vertex D, vertices number 6 and 7 respectively) and edge c (connecting vertex B and vertex E, vertices number 2 and 4, respectively). They were duplicated since these edges connect vertices belonging to a substructure to vertices no belonging to the substructure (external edges). In the figure we can see that vertex D has 2 edges f connecting it to 2 vertices SUB (preserving the direction of the edge) and vertex E has 2 edges c connecting it to 2 vertices SUB (remember that 4 instances were discovered, overlap is allowed). Finally, our compressed graph has 7 vertices: 4 vertices SUB, 1 vertex F, 1 vertex D, and 1 vertex E; and 13 edges: 5 edges OVERLAP, 4 edges labeled as g, 2 edges labeled as f, and 2 edges labeled as c. a 8 1 g F A b b a E c B C B C 5 6 f 7 D Figure MDL example 2 - input graph_2. Figure MDL example 2 - overlap. For the following examples the overlap parameter is always set to true. 73

162 Figure 4.16 shows our new graph_2, we only changed the direction of the guide edge. Now it starts from vertex F (number 8) to vertex A (number 1). The generated compressed graph is shown in figure In the compressed graph we observe that, this time, the 4 edges g start from vertex F to the 4 vertices SUB (the direction of the edge is preserved). Everything else remains as the previous example. a 8 1 g F A b b a E c B C B C 5 6 f 7 D Figure MDL example 3 - input graph_2. Figure MDL example 3 - overlap. For the next example, we only change from a directed guide edge to an undirected one. Our graph_2 is the graph shown in Figure By running MDL we generate the compressed graph presented in Figure The only modification is that the edges between vertex F and the 4 vertices SUB are undirected edges. Everything else remains unchanged. 74

163 a 8 1 g F A b b a E c B C B C 5 6 f 7 D Figure MDL example 4 - input Figure MDL example 4 - overlap. graph_2. For illustrating a new outcome of the overlap feature in the standalone MDL function, we change our graph_2 as follows: Vertex number 8 labeled as F is deleted. Our guide edge labeled as g is deleted. A new edge labeled as g (our guide edge) is added between vertex number 1 labeled as A and vertex number 3 labeled as C. For the current example we use a directed edge. Our graph_2 is shown in Figure The graph has 7 vertices and 7 edges. The new edge has the characteristic that it is an edge between two vertices belonging to the same substructure and in some cases, as we will see, they belong to overlapping substructures. Remember that overlap is allowed between instances, so the FindInstances function finds 4 instances. 75

164 The generated compressed graph is shown in Figure We mentioned that our guide edge joins 2 vertices belonging to a same substructure (vertex number 1 labeled as A and vertex 3 labeled as C, and in some cases these vertices belong to overlapping substructures. So, in the graph there are 6 edges labeled as g, 4 edges joining vertices SUB (telling us that they overlap), and two self edges in 2 vertices SUB. These last 2 edges are our new case of validation. The self edge tells us that inside the substructure, there is an edge joining two vertices belonging to the same substructure. The direction of the edges does not matter since this is an edge inside the substructure (an internal edge). The other duplicated edges (i.e., edge c and edge f) were computed using the previous described procedure. Finally, our compressed graph has 6 vertices: 4 vertices SUB, 1 vertex D, and 1 vertex E; and 15 edges: 5 edges OVERLAP, 6 edges labeled as g, 2 edges labeled as f, and 2 edges labeled as c. a 1 b A b a g c E B C B C 5 6 f 7 D Figure MDL example 5 - input Figure MDL example 5 - overlap. graph_2. 76

165 For the example shown in Figure 4.22, we only changed the direction of the guide edge. Now it starts from vertex number 3 labeled as C to vertex number 1 labeled as A. The generated compressed graph is presented in Figure The generated compressed graph is the same one that in the previous example. This is telling us that just changing the direction of the guide edge does not have effect in the resulting compressed graph. a 1 b A b a g c E B C B C 5 6 f 7 D Figure MDL example 6 - input graph_2. Figure MDL example 6 - overlap. For the next example, we only change from a directed guide edge to an undirected one. Our graph_2 is the graph shown in Figure The generated compressed graph is presented in Figure The single change is that all edges g are now undirected edges. Everything else remains unchanged. 77

166 a 1 b A b a g c E B C B C 5 6 f 7 D Figure MDL example 7 - input graph_2. Figure MDL example 7 - overlap. Now, for illustrating a new outcome of the overlap feature in the standalone MDL function (self edge in a vertex shared by instances of a substructure), we change our graph_2 as follows: Edge labeled as g between vertex number 1 labeled as A and vertex 3 labeled as C is deleted. A new edge labeled as g (our guide edge) is added between vertex number 1 labeled as A and vertex number 1 labeled as A (a self edge). For the current example a directed edge. Our graph_2 is shown in Figure The graph has 7 vertices and 7 edges. The new edge has the characteristic that is a self edge; it starts and ends in the same vertex. This vertex belongs to overlapping substructures, so we have 4 substructures sharing it. The resulting compressed graph is shown in Figure Since our guide edge is a self edge the compressed graph has 10 edges labeled as g. There are 6 edges joining vertices 78

167 SUB (telling us that they are overlapping substructures), and 4 self edges in 4 vertices SUB telling us that they have an edge which origin and destination vertices are the same. We note in the compressed graph a special characteristic between some vertices SUB. We refer that some pairs of vertices SUB are joined by 2 edges labeled as g. One of them starts in the first vertex SUB and ends in the second one. The second edge has a vice versa direction. This is because they are overlapping substructures with a common vertex and this vertex is the source and destination of an internal edge. Finally, our compressed graph has 6 vertices: 4 vertices SUB, 1 vertex D, and 1 vertex E; and 19 edges: 5 edges OVERLAP, 10 edges labeled as g, 2 edges labeled as f, and 2 edges labeled as c. g a 1 b A b a E c B C B C 5 6 f 7 D Figure MDL example 8 - input Figure MDL example 8 - overlap. graph_2. For the last example, we use as graph_2 the one shown in Figure We only change our guide edge from a directed edge to an undirected one. Figure 4.29 shows the generated compressed graph. There are 2 differences between this compressed graph and the previous 79

168 one. First, the edges labeled as g are undirected edges. Second, the special characteristic for some vertices SUB is implemented as follows: Currently, we note that those vertices SUB joined by 2 edges labeled as g in the previous example, now, they are joined by just 1 undirected edge labeled as g. Since the guide edge is undirected we only need 1 edge for representing them because they are overlapping substructures with a common vertex and this vertex is the source and destination of an internal undirected edge. The 4 vertices SUB have self edges, but they are undirected ones this time. Everything else remains unchanged. Finally, our compressed graph has 6 vertices: 4 vertices SUB, 1 vertex D, and 1 vertex E; and 16 edges: 5 edges OVERLAP, 7 edges labeled as g, 2 edges labeled as f, and 2 edges labeled as c. g a 1 b A b a E c B C B C 5 6 f 7 D Figure MDL example 9 - input graph_2. Figure MDL example 9 - overlap. 80

169 4.3 Limited Overlap As we have described in the previous section, the overlap feature plays a preponderant role in the Subdue s substructures discovery system. As a consequence, the generated results are also conditioned, in a high percentage, to this parameter. However, as we have seen, it is implemented to allow overlap among any instances of a substructure or among all the instances of a substructure. In this context, we propose a new approach named limited overlap. The major feature in this approach is to give the user the means to specify the set of vertices where an overlap is allowed. These vertices may represent significant elements in the context we work with. In the following subsections we will present the motivation, advantages, new algorithm, and an example of the new overlap feature implemented in the Subdue system. Motivation As we have already commented, the current overlap feature in Subdue is implemented in an orthodox way: all or nothing. It means that Subdue allows the overlap among all the instances sharing at least one vertex or that Subdue does not allow (discard) the overlap among instances sharing at least one vertex. But we argue that a third option is needed, an option where the user has the capability to set over on which vertices the overlap will be allowed, it is a limited overlap. We visualize directly three motivations issues to propose the implementation of the new algorithm that will be explained in the following subsections: 81

170 Search space reduction. Processing time reduction. Specialized overlapping pattern oriented search. In order to help us to describe the characteristics of the limited overlap we will use the graphs show in Figure 4.30 and 4.31 as input graphs in the future examples. Input graph PS_1 (Predefined Substructure number 1) has 2 vertices and 1 edge, input graph PS_2 (Predefined Substructure number 2) has also 2 vertices and 1 edge, and finally graph_3 has 9 vertices and 8 edges. 1 A 1 C a 2 B 2 PS_1 f D PS_2 8 1 g B A a c E B a b b C B C 5 6 f f D D Figure Limited overlap - input Figure Limited overlap - input graph_3. graphs PS_1 and PS_2. 1) Search space reduction. In knowledge discovery systems using a graph-based approach, the data mining algorithm uses graphs as a knowledge representation; the search space of a graph-based data mining algorithm consists of all the subgraphs that can be derived from its input graph. The substructures discovery process in Subdue begins with the creation of the substructures matching a single vertex in the graph (one for each of the different labels in the graph). 82

171 Each iteration through the algorithm selects the best substructures and expands the instances of these substructures by one neighboring edge (or an edge and new vertex) in all possible ways. But as part of the process to select the best substructures and then to expand them there exists a filter phase. In this phase according to the overlap parameter the instances of a substructure are evaluated: if overlap is set to true then overlapping instances are kept, otherwise if overlap is set to false then overlapping instances are discarded. For example, suppose we want to search the instances of PS_1 and PS_2 in graph_3. With the overlap set to false Subdue discovers 2 instances, one instance for each predefined substructure as we can see en Figure PS_SUB_X is the nomenclature used by Subdue to identify it is an instance of Predefined Substructure SUB_X; where SUB_X means it is the SUBstructure number X. But, if overlap is set to true, Subdue discovers 4 instances, 2 instances for each predefined substructure (see Figure 4.33). Finally, suppose the user wants to search instances of PS_1 and PS_2 in graph_3 but this time he considers that vertices A have a higher relevance (for example, a remarkable spatial object in a spatial database) so he proposed to use a limited overlap, the overlap will be allowed just among instances containing vertices A. We can see that PS_1 has a vertex A, thus this time Subdue finds 3 instances, 2 instances of PS_1 and 1 instance of PS_2 as shown in Figure In the figure we can observe that instance 1 and instance 2 share the vertex number 1 labeled as A. For PS_2 Subdue finds 1 instance since the other one (instance number 4 in Figure 4.33) has also the vertex number 6 labeled as C, but the overlap is just allowed among vertices A. 83

172 Instance 1 1 A a Instance 2 6 C a Instance 1 Instance 2 Instance 3 Instance 4 1 A 1 A 6 C 6 C a a f f 5 B 9 D 2 B 5 B 7 D 9 D PS_SUB_1 PS_SUB_2 PS_SUB_1 PS_SUB_1 PS_SUB_2 PS_SUB_2 Figure No overlap - SGISO. Figure Overlap - SGISO. Instance 1 Instance 2 1 A 1 A a a Instance 3 6 C f 2 B 5 B 9 D PS_SUB_1 PS_SUB_1 PS_SUB_2 Figure Limited overlap PS_1 - SGISO. We have mentioned that the best discovered substructure by Subdue (by iteration) can be used to compress the input graph, which can then be input to another iteration. After several iterations, Subdue builds a hierarchical description of the input data where later substructures may be defined in terms of substructures discovered on previous iterations. We have also commented that each iteration through the algorithm selects the best substructures and expands the instances of these substructures by one neighboring edge (or an edge and new vertex) in all possible ways. So the number of instances of the substructures defines the search space (by iteration) in the substructure discovery process. As we can see in our example, by using the limited overlap we obtain a search space reduction (with overlap set to true), since the number of instances becoming candidates to be expanded is selected according to the allowed values given by the user. 84

173 2) Processing time reduction. The reduction in the number of instances becoming candidates to be expanded results in a search space reduction, and this effect also has a new outcome, a processing time reduction. Allowing overlap slows Subdue considerably since the number of candidate instances to expand, to evaluate, to match, and to discover increase as we have seen. However, by the implementation of the limited overlap, the number of instances to be processed in these phases decrease resulting also in a processing time reduction in the overall substructure discovery process. 3) Specialized overlapping pattern oriented search. We have also commented that the limited overlap gives the user the capabilities to define the set of interesting elements over which the overlap will be allowed (the elements are represented as vertices in the graph according to the proposed model) so the algorithm will discard the overlapping elements that the user does not considerer significant. In the example presented in Figure 4.34, the judgment of the user is that vertices A have a higher relevance so he proposes to use a limited overlap, the overlap will be allowed just among instances containing vertices A. This consideration is a personal decision of the user according to the work context (in many cases with the support of a domain expert). For instance, a remarkable element may refer to a spatial object in a spatial database or to some characteristic defining a particular topic of a dataset. 85

174 Therefore, the limited overlap gives the user the mechanics to implement a pattern oriented search. The user delimits the set of elements that will have a preponderant role in the substructure discovery process. But this characteristic gives also a new advantage, the patterns evaluation process is simplified since the set of generated results is smaller because it is focused over the user requirements. Algorithm As we have described, the limited overlap gives the user the capabilities to define the set of elements (vertices in the graph) where overlap among instances is allowed. The process starts reading the set of vertices which are integrated to the Subdue s parameters as the limited overlap label list. During the substructure discovery process a filter phase is performed. This phase evaluates (and possible discards) based on the overlap parameter the list of discovered instances. If the substructure discovered process is composed by several iterations, after each of them, the overlap label list is updated to integrate the new vertices where overlap is also allowed. Remember that at the end of each iteration, the best discovered substructure by Subdue is used to compress the input graph, which can then be input to another iteration of Subdue. After several iterations, Subdue builds a hierarchical description of the input data where later substructures are defined in terms of substructures discovered on previous iterations. Therefore, if the best discovered substructure, in each iteration, has a vertex where overlap is allowed then that substructure is added to the limited overlap label list. 86

175 Limited Overlap (instance1, instance2) //Global process while ((elementsinstance1 < totalelementsinstance1) AND NOT endprocess){ //Instance1 s vertex vertex1 = vertex from instance1 //Instance2 s vertex while (elementsinstance2 < totalelementsinstance2) vertex2 = vertex from instance2 //Instances overlap if (vertex1 = vertex2) then{ intancesoverlap = true endprocess = true } //Instances overlap, check by limited overlap if (instancesoverlap AND overlaplabellist NOT EMPTY) then{ if (vertex1 in overlaplabellist) then //It is a limited overlap limitedoverlap = true else //Process continues until all vertex1 are validated endprocess = false } } return instancesoverlap, limitedoverlap Figure Limited overlap in the Subdue system. In the validation process each vertex belonging to instance1 (vertex1) is validated against all vertices belonging to instance2 (vertex2). If vertex1 and vertex2 are the same then they are overlapping instances. If the instances overlap then vertex1 is validated against the list containing the set of vertices allowed for overlapping (the overlaplabellist). If vertex1 exists in overlaplabellist then it is a limited overlap. Example To illustrate the functionality of the new algorithm, we will present some examples using the new version of Subdue implementing the limited overlap feature. The input graphs for 87

176 the examples are those shown in Figure 4.30 and Figure 4.31; the idea is to perform a pattern oriented search by discovering patterns in graph_3 but based on PS_1 and PS_2 (our predefined patterns). Subdue implements a pattern oriented search by, initially, finding all instances of PS_1 and PS_2 in graph_3. The next step is to compress graph_3 using the found instances of each PS_1 and PS_2. Figure 4.36 shows the compressed graph with the overlap parameter set to false. As we can see, Subdue found 1 instance of PS_1 (i.e., vertex PS_SUB_1) and 1 instance of PS_2 (i.e., vertex PS_SUB_2) since overlap among instances is not allowed (see also Figure 4.32). Figure No overlap - compressed graph. Finally, the compressed graph becomes the input graph to discover substructures. Figure 4.37 shows the best 3 substructures (default number of reported substructures) found by Subdue according to its substructure discovery system (see Section 4.1 for details). Each substructure has 1 instance; it means that there exists 1 repetition of each of them in the input graph (in the example we use exact graph match, but Subdue allows also inexact graph match). The first one (labeled as a ), is composed by 2 vertices: 1 vertex B, and 1 88

177 vertex E; and 1 edge labeled as c. Subdue reports this substructure as the best discovered substructure because it is the best that minimize the description of the input graph according to the MDL principle. The second one (labeled as b ) is composed by 4 vertices: 1 vertex PS_SUB_1, 1 vertex C, and 2 vertices B; and 3 edges labeled as a, b. g. The vertex PS_SUB_1 is itself a substructure composed by two vertices: 1 vertex A, and 1 vertex B; and 1 edge labeled as a. We have already commented that later substructures may be defined in terms of previous discovered substructures, or in term of predefined substructures as in this example. Finally, the third one (labeled as c ) is composed by 4 vertices: 1 vertex PS_SUB_1, 1 vertex PS_SUB_2, 1 vertex C, and 1 vertex B; and 3 edges: 2 edges labeled as b, and 1 edge labeled as g. Once again vertices PS_SUB_1 and PS_SUB_2 are themselves substructures. 89

178 a) b) c) Figure No overlap - discovered substructures. For the next example we set overlap to true. The generated compressed graph is shown in Figure Now, Subdue finds 2 instances for each PS_1 and PS_2 (see Figure 4.33 for details). 90

179 Figure Overlap - compressed Graph. Figure 4.39 shows the best 3 substructures discovered by Subdue. The first and second ones (labeled as a and b ) are themselves the predefined substructure PS_SUB_2 and PS_SUB_1 respectively with 2 instances each of them. The third one (labeled as c ) is a substructure composed by 2 vertices and 1 edge with 1 instance. 91

180 a) b) c) Figure Overlap - discovered substructures. Our next example is implemented by using the limited overlap feature. Suppose the user wants to perform a specialized overlapping pattern oriented search: he only wants to allow overlapping instances in vertices representing PS_1. The generated compressed graph is shown in Figure As a result of the restriction for overlapping instances Subdue finds 2 instances of PS_1 (they share the allowed vertex A) and just 1 instance of PS_2 (since they share a not allowed vertex C). The found instances are shown in Figure

181 Figure Limited overlap PS_1 - compressed graph. Figure 4.41 shows the best 3 substructures discovered by Subdue from this graph using the limited overlap. In the figure we can see that Subdue reports as the best substructure PS_SUB_1 with 2 instances (labeled as a ). This is consequence of the integration of the compressed graph since it has 2 vertices PS_SUB_1. The second reported substructure (labeled ad b ) is composed by 2 vertices: 1 vertex PS_SUB_1, and 1 vertex E; and 1 edge labeled as c. The last one (labeled as c ) is composed also by two vertices PS_SUB_1; and 1 edge labeled as PS_OVERLAP_1. 93

182 a) b) c) Figure Limited overlap - discovered substructures. 4.4 Conclusion In this chapter we have described the characteristics and functionality of our graph-based data mining tool, the Subdue system. We introduced a new algorithm named limited overlap. We presented some examples showing the functionally of the new algorithm using an artificial dataset. Examples using real world data will be presented in Chapter 6. These 94

183 examples are developed by using two test domains: a Puebla downtown population census from the year of 1777 and a Popocatépetl volcano database. In the next chapter we introduce a prototype system implementing the proposed model for representing spatial data, non-spatial data and spatial relations among the spatial objects as a whole dataset using a graph-based representation. 95

184 Chapter 5 PROTOTYPE A natural skill of people is the interpretation of visual data; therefore, this is a fact we must consider to get advantage during the data mining process. In [30] the authors say that a future direction for the KDD research field is the design and use of user interfaces: one can create a query language which may be used by non-database specialists in their work. Such a query interface can be supported by a Graphical User Interface (GUI) which can make the process of query creation much easier. The challenge consists of improving the capabilities for displaying the generated results (i.e., from a query or the data mining process) in a graphical mode. The idea is that if we are able to analyze the results in such a graphical way, we may give feedback to the user so that he can refine the analysis process and/or guide the direction for further study. This is the principle in relevance feedback (do an initial query, get feedback from the user, and then incorporate information obtained from prior relevance judgments to redefine the query). 96

185 We developed a prototype system that provides a graphical user interface to perform the data mining phase (and data analysis) using our proposed graph-based representation for spatial and non-spatial data. The analysis is implemented by using spatial and non-spatial queries and the data mining process is performed with the Subdue system, a graph-based data mining tool. In our research we used two test domains to evaluate our proposal. The first one is a database storing data from the Popocatépetl volcano (see Section 3.4 for details). The other one is a database containing data related to a population census from the year of 1777 in Puebla downtown as described in Section Population Census from the year of 1777 in Puebla downtown As our test domain we have worked in the project Habitar y vivir. Análisis del espacio habitacional de la ciudad de Puebla This is a project directed by Dra. Rosalva Loreto López, a researcher of Urban History. Our objective is to make use of data mining techniques to find interesting relations and patterns between the population and habitation spaces in Puebla downtown during that time period. We argue that our model can find patterns involving non-spatial data (i.e., characteristics of people living in the zone), spatial data (i.e., distribution of the space), and relations between them (i.e., characteristics of houses based on people social status and/or number of people living in a house) in a single pattern. 97

186 Figure 5.1 shows the spatial structure we implemented for representing the spatial concepts in the census. The structure has 5 spatial aggregation levels. Parish (spatial concept for representing a physical area) is the upper spatial component. A Parish is defined by one or more Neighborhoods. A Neighborhood can involve several Blocks. A Block is defined by four Streets and finally a Street gives the location to a House, where a family lives. Parish Block Neighborhood Street House Street Neighborhood Block Parish Figure 5.1. Representation of spatial concepts in the census from the year of After defining the spatial concepts involved in our domain, we need a graph-based data representation to describe our data as a graph. We have implemented two graph-based representations for the non-spatial data of the census. These representations have three main components: (1) HOUSE involves the attributes describing a House. It contains one or more Uh (atomic physical area where a family lives). One or more families might inhabit in a House, but each family lives in an Uh. (2) UH involves the attributes describing the living space. (3) MEMBER involves the attributes describing a member of a family. 98

187 Figure 5.2 shows the first representation created for the population census. A description of the representation is as follows: a Parish contains one or more Neighborhoods. A Neighborhood contains one or more Blocks or Locations (spatial concept for identifying with precision a physical area -i.e., north of-). Block and Location contain one or more Streets. A Street contains one or more HOUSES. A HOUSE has two attributes (NCasa -Id assigned to the House- and NHabCasa -number of Uh in a House-). A HOUSE contains one or more UH. An UH has several attributes describing it (Uh -Id assigned to the Uh-, Etnicidad -predominant ethnic group among the members of a family-, TFamilia -type of family, i.e., family with children-, TUh -type of Uh- and NMiembrosFamilia -number of members integrating a family-). An UH contains one or more MEMBERS. Finally a MEMBER has several attributes describing it (JFamilia -head of family-, TitPersona -title of the member, i.e., don, doña-, Nombre -first name-, Apellido -last name-, Sexo -sex-, EdoCivil -marital status-, Parentesco -social relationship with respect to the head of family-, GpoEtnico -ethnic group- and Edad -age-). 99

188 Parish Block value value contain Neighbor hood contain contain Location contain contain Street contain NHabCasa NCasa HOUSE contain value UH Uh Etnicidad TFamilia TUh NMiembrosFamilia contain value value value value MEMBER JFamilia TitPersona Nombre Apellido Sexo EdoCivil Parentesco GpoEtnico Edad value value value value value value value value value Figure 5.2. First representation for the non-spatial data in the population census from the year of The second representation is presented in Figure 5.3. The representation can be read as follows: A HOUSE is the main item. A HOUSE has several attributes describing it (Parish, Neighborhood, Block, Location, Street, NCasa and NHabCasa). A HOUSE contains one or more UHS. An UH has several attributes describing it (Uh, TUh, Etnicidad, TFamilia, NMiembrosFamilia). In a UH lives (contains) one or more MEMBERS, and finally, a MEMBER has several attributes describing it (Jfamilia, TitPersona, Nombre, Apellido, Sexo, EdoCivil, Parentesco, GpoEtnico and Edad). 100

189 value Parish Uh value JFamilia value value Neighborhood TUh value TitPersona value value Block HOUSE contains UH Etnicidad value Nombre value value Location TFamilia value Apellido value value Street NMiembros Familia value Sexo value value NCasa contains MEMBER EdoCivil value value NHabCasa Parentesco value GpoEtnico value Edad value Figure 5.3. Second representation for the non-spatial data in the population census from the year of Modules This section presents a prototype system developed for testing our graph-based data representation model for spatial data mining as proposed. The prototype is implemented in Java. We use Oracle DBMS version 9i for storing and processing our data. Oracle has a module to manage spatial data named Oracle Spatial; we take advantage of the spatial operators, geometry functions and spatial aggregate functions implemented in the module for either identifying the objects in a region over a spatial layer or obtaining/validating the spatial relations between two spatial objects. The prototype is divided into seven modules: 101

190 Data Preparation and Cleaning. The data preparation and cleaning phase is a step of the Knowledge Discovery in Databases (KDD) process. Our module helps the user to validate the data and creates the necessary structures to store it in the database. In this process, the data mining expert and the user must work together to identify the activities to be performed that are related to the process (i.e., to identify noise and remove it, treat missing values, etc). Once these activities have been defined, the process is transparent to the user when adding more data (belonging to the same database schema for the domain). Query. We have implemented the interface shown in Figure 5.4 to allow the user to create non-spatial queries. The goal is to allow the user, to make use of spatial and non-spatial attributes, to query the database and to build graphs from the obtained results. The Query panel allows the user to query the database using an SQL-like approach. The queries are created in real time and then submitted to the database for answering the user request. The results are presented to the user in two ways. The first uses a relational approach and the second is represented over a map. 102

191 Figure 5.4. The query panel. The interface is divided in 2 sections, the Control and the Results sections. The Control section is divided in 4 subsections: Select, Where, OrderBy and GroupBy. The Results section is divided in 2 subsections: Query visualization and Generated results. In the Control section the user creates the query. The query is created by specifying the 5 most usual clauses in a SQL statement: Select From-Where-OrderBy-GroupBy. The first two elements are mandatory, the last three are optional. 103

192 The Select subsection presents the twenty-two fields that compose the core of the database. Twenty one of these fields have four options implementing query functionalities as follows: option1 attribute name option2 options3 option4. The first option is used to indicate if the field will be included in the query. The second option contains the keywords used to implement, currently, five grouping functions: count", "count(distinct())", "distinct", "max" and "min". These keywords are used to group the data by the field(s) selected in the GroupBy subsection. Options three and four are used to indicate if the descriptive attributes associated to the field will be included in the query. If the first option is not selected then all the other options are ignored. Additionally, the Select subsection has an item containing the name of the schema (i.e., 1777 census) to create the from clause of the query. The Where subsection is used to indicate the set of conditions, by field, for restricting the rows to be selected (the dataset returned by the query). A condition specifies a combination of one or more expressions composed of attribute names, attribute values, and logical operators. These conditions are entered manually by the user, so knowledge about the domain is required. The OrderBy subsection allows the user to indicate if he wants to sort the rows returned by the query and which field(s) will be used to perform the operation. We can sort the rows returned using the twenty-two fields composing the database and we can also implement combinations of these elements (i.e., sorting by the Name and Sex fields). The GroupBy subsection was implemented to allow the user to group the selected rows based on the value(s) of each row and return a single row with a summary of the 104

193 information for each group. We can group the rows using twenty-one of the twenty-two fields of the database and we can also implement combinations of these elements (i.e., grouping by the Parish and Sex fields). The grouping aggregation function (i.e., count or sum) is chosen in the Select subsection. As we have already mentioned, we have implemented five grouping functions. Our prototype system creates queries by selecting the different fields and options contained in the four previous sections. This query is created in real time, displayed in the Query visualization area and then submitted to the DBMS for processing. The obtained result is displayed in the Generated results area; it is presented by using a relational approach (rows and columns). SQL. Using this interface the user can create a query by typing it directly (in manual form). The principle is the same as we described in the Query panel, we want to create queries and to use the results of the queries to create graphs (based in our model). These graphs will be the data source for our data mining algorithm so that we can find patterns that allow us to understand and describe our data. In the Query panel (see Figure 5.4) the user creates the query in real time by using a graphical interface but if he wants to modify it, it is not possible (the query is displayed just for visualization purpose). Each time the user creates a query in the Query panel, the created query is copied to the Query visualization area in the SQL panel so if the user wants to enhance or modify the query he is able to do it. The generated results are presented to the 105

194 user in the Generated results area. Both interfaces (Query and SQL panels) provide the user with the capability to send the generated results to a text file or a printer. Map. Another way to create graphs in the prototype is using the Map panel. In this case the interface displays in a graphical way the spatial layers stored in the database (see Figure 5.5). Figure 5.5. The map panel. By using the interface, the user can delimit the set of spatial and non-spatial data that will be used for creating the graphs. Some times the user wants to analyze only some regions in a spatial layer so it is not necessary to include all the data in the graph. Additionally, we 106

195 have to take into account that if we have a huge database we will build huge graphs and this feature has a direct impact over the data mining algorithm, so this is an important issue we have to face. A solution for facing the problem of creating huge graphs is delimiting the set of elements to be included in it by using selection windows as we mentioned in Chapter 3. A second method for delimiting the set of elements to be considered while creating a graph is by using the results of the non-spatial queries created and processed in the Query and SQL panels. This is implemented by a process that identifies the spatial objects involved in the results generated by a non-spatial query (i.e., a query computing the number of people, grouped by Sex, living in each Parish in the census from the year of 1777 so we can select and show the Blocks or the Streets belonging to each Parish over the map). The interface is divided in four sections: Visualization area, Layers in database, Operations control and Map control. The Layers in database section displays the name of the spatial layers stored in the database. The component is also used for selecting and identifying the spatial layers to work with. We also include information about the spatial relations to be considered in the graph. The Operations control section includes the operations (Topological, Distance and Direction) implemented to validate the spatial relations among spatial objects. In the case of the topological relations we have implemented the validations supported by the Oracle Spatial module. 107

196 The Map control section has five buttons implementing the Zoom in, Zoom out, Show all, Reset all and Save map operations. The Visualization area works in two operational modes: Query mode for creating selection windows and Zoom mode for implementing the Zoom in and Zoom out operations. The option X Layer is used to indicate if the validation of spatial relations among spatial objects will just be among objects belonging to different spatial layers (i.e., objects belonging to the Parish and Neighborhood layers) or among all objects belonging to all layers. The last two options are used to control the characteristics of the Selection window: a Selection window tool can be a Rectangle or a Circle, and we can specify if we want to select just the elements inside the Selection window or the elements inside and touching its border. Once the user has selected the working area, the following steps identify the objects inside the area and validate the spatial relation(s) among them. We have implemented the validation of topological, distance and direction relations; only the objects meeting the relation(s) chosen by the user will be candidates to become objects in the graph. Spatial Graph. The next task is to create the graph. First, the user must select the nonspatial attribute(s) describing the spatial objects in the graph (see Figure 5.6). Remember that the spatial objects and spatial relations among the objects that will be included in the graph were selected in the Map panel. 108

197 Figure 5.6. The spatial graph panel. By using this interface, the user can select the non-spatial attributes, for each spatial layer he works with, that will be related to the spatial objects in the graph. For instance, suppose the user is working with the spatial layers A and B; the spatial layer A has five non-spatial attributes and the spatial layer B has three non-spatial attributes. So, he may select from layer A two attributes and from spatial layer B three attributes. The idea is to give the user the capability to select the non-spatial attributes, for spatial layer, that he considers relevant for the mining task. 109

198 From the Graph characteristics section the user defines the format, model and output device characteristics for the graph. The graph generated by the system can be created following the Subdue or the Graphviz [19] layouts. In the first case the graph is created for feeding the Subdue system, and in the second case it is created for visualization purposes. Currently, we have implemented five graph-based representation models in our prototype. Each model expresses a representation proposal for creating graphs involving the three basic elements found in a spatial database (spatial, non-spatial data, and spatial relations among spatial objects). The resulting graph can be visualized on the screen or stored in a text file. The last option was implemented in order to provide the Subdue and Graphviz systems with their corresponding input files. We can see at the right side of Figure 5.6 an example of a created graph using the Subdue layout. This sample graph was generated from 2 spatial layers (i.e., chiglesia and chpuebla). From each of them the user selected one or more attributes that were related to the spatial objects. Figure 5.7 presents a fragment of the same graph but this time it is drawn using the Graphviz system. 110

199 Figure 5.7. Graph representation of processed data. Non-Spatial Graph. Focusing in the Habitar y vivir. Análisis del espacio habitacional de la ciudad de Puebla project we have implemented an interface for allowing the user the creation of graphs based on the two graph-based representations created for the non-spatial data of population census (they are shown in Figure 5.2 and Figure 5.3). These graphs are created using only non-spatial data. Our objective for implementing graphs with this characteristic is to have metrics that allow us to compare and to evaluate the results generated by the data mining algorithm when we use graphs containing spatial data, nonspatial data, and spatial relations at the same time against graphs containing only nonspatial data. The interface implemented is shown in Figure 5.8. The user selects the non-spatial attributes and defines the settings that will be used for creating a query (as in the spatial graph). The result obtained from the query is used for creating the non-spatial graph. 111

200 Figure 5.8. The non-spatial graph panel. Subdue. The Subdue panel contains the interface developed for calling the Subdue system (see Figure 5.9). In order to run Subdue, the user must select the corresponding text file (a file containing a graph) and define the parameters that will guide the Subdue s substructure discovery system for finding substructures. 112

201 Figure 5.9. The Subdue panel. Figure 5.10 shows an example of a Subdue s standard output (only for one substructure) for displaying the discovered substructures (i.e., patterns) from an input graph. Since our definition of instances and substructures as graphs (see Chapter 4.1 for details), the Subdue s output is also a graph, in our example, it can be read as follows: Substructure value = Represents the value of the substructure in the input graph. Pos instances = This value tells us how many instances of the substructure exist in the input graph. 113

202 Graph (2v, 1e). Number of vertices ( v ) and edges (in our example directed edges d ) in the substructure. v 1 SUB_4. Substructure s first vertex labeled as SUB_4. v 2 X. Substructure s second vertex labeled as X. d 1 2 ETNICIDAD. A directed edge from vertex 1 to vertex 2 labeled as ETNICIDAD in the substructure. Substructure: value = , pos instances = 3865, neg instances = 0 Graph(2v,1e): v 1 SUB_4 v 2 X d 1 2 ETNICIDAD Figure Example of Subdue s standard output. As we have already commented, Subdue is a system that finds substructures in a hierarchical way, that is, a substructure found in a previous iteration can appear in a new iteration. When this happens, those substructures are represented by a vertex labeled as SUB_x, which represents the best substructure discovered at iteration x. In our example (Figure 5.10), SUB_4 is itself a substructure defined by 2 vertices and 1 edge where its first vertex is labeled as SUB_2. This means that the definition of SUB_4 is composed by the definition of SUB_2. Substructure SUB_2 is defined by 2 vertices and 1 edge where its first vertex is labeled as SUB_1. Again, this means that the definition of SUB_2 is composed by the definition of the previously discovered substructure SUB_1. Finally, SUB_1 is a substructure defined by 2 vertices and 1 edge. 114

203 As we can see, the lecture and interpretation of the substructures discovered by Subdue may be a complicated task. We have implemented a parsing function for reading a text file containing the results generated by Subdue, so we can create the necessary data structures for presenting to the user the same results but using an easier way to read it as we show in Figure 5.11 (the figure presents the example described in Figure 5.10). This layout is created by using the Graphviz system and is saved as a JPEG image file but we can save it in any output format supported by Graphviz. The goal is to improve the way the user can read and interpret the results generated by Subdue. Figure Layout for reading the Subdue s discovered substructures. 5.3 Conclusion As test domain we have designed and built a spatial database to store both a population census from the year of 1777 in Puebla downtown and a map representing the blocks in the 115

204 zone. This data is part of a project directed by Dra. Rosalva Loreto López, a researcher in the urban history domain. We have developed a prototype system implementing our model to represent together spatial data, non-spatial data and the spatial relations among the spatial objects. The prototype allows the user to select the spatial layers to work with, to create spatial and nonspatial queries that will be used to select the spatial objects that will be included in the graph. For each spatial layer the user work with, he has the capability to select the attributes that will be related to the spatial objects in the graph. We have also implemented a visualization tool which helps us to display in a graphical way (by using the Graphviz system) the hierarchical discovered substructures by Subdue. In the next chapter we present three cases showing the applicability of our methodology for modeling and mining spatial data mining using a graph-based representation. 116

205 Chapter 6 RESULTS This chapter presents three use-cases of our methodology for modeling and mining spatial data using the proposed graph-based representations. For this purpose, we used two spatial databases as our test domains. The first database contains data related to a population census from the year of 1777 in Puebla downtown and the second one is a database storing data from the Popocatépetl volcano. The use-cases described were implemented based on the following premises: evaluating the graph-based proposal for modeling and mining spatial data, evaluating the limited overlap feature, and evaluating the discovered knowledge with the support of a domain expert. Therefore, the three use-cases presented in this chapter were implemented based on the following methodology: 1. Selection of the spatial layers to work with. 2. Selection of the spatial relations that will be validated among the spatial objects. 3. Selection of the non-spatial attributes that will be related to the spatial objects in the graph. 117

206 4. Mining the graph using the no overlap, standard overlap and limited overlap features. 5. Evaluation of discovered patterns. 6.1 Population census from the year of 1777 in Puebla downtown As we have already mentioned, our first test domain is a spatial database containing data of a population census from the year of 1777 (see Chapter 5 for more details). Figure 6.1 shows a fragment of the chpuebla, chiglesia and chrio spatial layers used in the usecases. Figure 6.1. Population census from the year of 1777 in Puebla downtown. 118

207 The chpuebla spatial layer (shown in white color) represents blocks in Puebla downtown; this layer is related to a population census from the year of 1777 as non-spatial data. The chiglesia layer (shown in red color) contains representative churches for each parish in the zone. The chrio layer (shown in green color) represents a river crossing Puebla downtown. It is important to remark that a parish is a spatial object grouping several blocks in the zone. Each parish has a church as its agglomerative element (people used to live around a church). Figure 6.2 shows the six parishes (each shown in a different color) and their representative church: 1. El Sagrario. 2. San José. 3. San Marcos. 4. San Sebastián. 5. Santa Cruz. 6. Santo Angel. 119

208 Figure 6.2. Parishes in Puebla downtown in the 1777 year. All use-cases presented in this section were developed using all graph-based models (currently five models). However, the description of the generated results in this first test domain corresponds only to model #2 (the model is named single replication of relation types, complete information). Some of the discovered patterns were used by the domain expert to validate facts already known (i.e., distribution of the population in the census) and other allows him to know unknown relationships among spatial objects and non-spatial data (attributes) in the census (i.e., common characteristics of people living along the two borders of the river crossing Puebla downtown). In Section 6.2 we present a comparison among the generated results by each proposed model using a Popocatépetl volcano database. 120

209 6.1.1 Use-case: El Sagrario Suppose we want to know what people have in common in the spaces within a radius of 150 meters from the representative church in parish #1 (El Sagrario). Our experiment will be focused to find regularities related to the following issues: Number of habitation spaces in a house. Members of a family. Type of family. Ethnic group of each family member. The guideline to select a radius of 150 meters from the church is that this value allows us to include in our sample dataset at least one block in all directions around the church as we show in Figure 6.3. Thus, by using the prototype system we selected the chpuebla and chiglesia spatial layers. Once we selected the spatial layers to work with, the following steps are to select the spatial relation(s) to be validated among the spatial objects and the non-spatial attributes that will be related to the spatial objects in the graph. The selected parameters were the following: Spatial layers: chpuebla and chiglesia. Pivot: representative church in parish #1. Spatial relation: distance. o Value: 150 meters within a radius from the representative church to the blocks. Spatial graph-based model: model #2. 121

210 Figure 6.3. Blocks 150m. from representative church, parish El Sagrario. The generated graph was composed of 24,167 vertices, and 24,166 edges according to the proposed graph-based model. Figure 6.4 shows a fragment of the graph. Figure 6.4 Example of a generated graph in the use-case El Sagrario. 122

211 This graph was used as the input to the Subdue system. For the experiment we used the following Subdue s parameters: Predefined substructure: yes (we used UH CONTAIN MEMBER since these elements are grouping components in our graph-based representation for the nonspatial data of the census). See Section 5.1 for more details. Overlap: yes. Limited overlap: no/yes (first, we used standard overlap; next, we used limited overlap). Once Subdue completed the mining process, the generated results (patterns) were evaluated by the domain expert. Figure 6.5 shows three examples of discovered patterns using the standard overlap option. The expert s opinions were based on the following issues: the patterns are based on the population distribution schema in the census from the year of 1777 (large population inhabits in parish #1). 65% of the population, in this area, did not given its ethnic group, this can be interpreted from a demography history perspective, as a possible dissolution of the racial element for grouping people (creation of groups or classes) in benefit of alternative grouping parameters such as salary, family networks (how they lived and whom they lived with), consumption levels, type of house. 16% said to be Spanish. People lived based on the model Jefe con Familiares y Agregados. 123

212 a) b) c) Figure 6.5. Examples of discovered patterns by using standard overlap in use-case El Sagrario (1). We can see that the patterns shown in the figure are composed of spatial data (i.e., a vertex representing a house) and non-spatial data (i.e., a vertex and an edge telling us the major ethnic groups in the census). Although those patterns do not contain any reference to spatial relations we must to note that a spatial relation was used to define the dataset for the mining phase (i.e., 150 meters within a radius from the representative church to the blocks). They do not appear in the discovered patterns but they are part of the input graph. An explication to this situation is that most common characteristics (repetitions in the input graph), in this example, occur in the non-spatial data describing the spatial objects (i.e., the ethnic group X- Undefined and E- Spanish ). However, these facts represent the strength of our proposal since our basic idea is to represent spatial data, non-spatial data and spatial 124

213 relations among the spatial objects as a unique dataset, mine it together, so we can discover patterns involving these elements at the same time. The next step in our experiment was to evaluate our limited overlap proposal in Subdue. We stated three motivations to propose the new algorithm: a search space reduction, time processing reduction and specialized overlapping pattern oriented search. Therefore, in order to demonstrate that by using the new approach we obtain those benefits we implemented the following example. For the limited overlap test, we allow overlap only in vertices representing the ethnic group of the family members. These vertices are used to guide the overlapping pattern oriented search. In this example, the discovered patterns by using the standard and limited overlap features were slightly different. Figure 6.6 presents three examples of discovered patterns using the limited overlap. For example, both cases reported as the first substructure a pattern telling us Undefined is the predominant ethnic group in the dataset. The second discovered substructure, for both cases, tells us that Spanish is the next predominant ethnic group. In the case of the third substructure, using standard overlap Subdue reported a pattern related to the number of habitation spaces in a house, but through limited overlap Subdue found a relationship among the Undefined ethnic group and the family type Jefe con Familiares y Agregados. An interpretation of these results is that limited overlap reports in all discovered substructures a vertex representing ethnic group because we oriented the search over this type of vertices when we specified that vertices representing ethnic group were allowed for overlapping. 125

214 a) b) c) Figure 6.6. Examples of discovered patterns by using limited overlap in use-case El Sagrario (1). The next step in our experiment was to evaluate the objective of a processing time reduction by using the limited overlap feature over the standard overlap. Figure 6.7 presents the processing time comparison chart for this experiment. In the figure, the time taken by using standard overlap is shown in pink color (193,426 seconds). On the other hand, the processing time taken by using the limited overlap option is shown in yellow color (36,042 seconds). As we can see, we obtained a time reduction gain of 81.37% using our proposed limited overlap approach. 126

215 36, ,426 Limited Overlap Overlap Seconds Figure 6.7. Processing time standard vs. limited overlap: use-case El Sagrario (1). Now, we are going to modify slightly our study zone to show the way we can work with two or more spatial relations at the same time. By using the prototype system the user can select one or more spatial relations that will be validated among the spatial objects. For instance, it is possible to select two spatial relations belonging to the same spatial relation type (i.e., topological) or to different spatial relations types (i.e., one topological and one distance relation). Suppose we want to search for regularities about people and habitation spaces in a radius of 150 meters from the same representative church in parish #1, but this time, we only want to evaluate blocks located on the North side as shown in Figure 6.8. Our experiment will be focused to know common regularities over the same issues as in the previous test. By using the prototype system we selected the following parameters: Spatial layers: chpuebla and chiglesia. Pivot: representative church in parish #1. Spatial relation: distance. 127

216 o Value: 150 meters. Spatial relation: direction o Value: North. Spatial graph-based model: model #2. The generated graph was composed of 12,021 vertices, and 12,026 edges according to the proposed graph-based model. Figure 6.8. Blocks 150m. North side from representative church, parish El Sagrario. As in the previous example, the graph was used as input to Subdue. For the experiment we selected the following Subdue s parameters: 128

217 Predefined substructure: yes (we used UH CONTAIN MEMBER since these elements are grouping components in our graph-based representation for the nonspatial data of the census). See Section 5.1 for more details. Overlap: yes. Limited overlap: no/yes (first, we used standard overlap; next, we used limited overlap). Once Subdue completed the mining phase, the generated results were evaluated by the domain expert. The expert found the following facts: the predominant ethnic group in the area remains as Undefined because of the proximity to the parish center and the continuous racial and social interchange from the North side. As consequence, they share the same family type structure. The Spanish population is the most important (15%) and the next one is Mestizos (7.13%). The family nucleus which includes other people living with (added people) employs Mestizos as subordinated workers (i.e., waiters and salesmen). It is important to remark that they do not employ Indígenas (maybe because they do not speak Spanish and their limited cultural level). This is the domain expert evaluation but we also needed to evaluate how the new limited overlap feature worked. Therefore, the same experiment was performed using the standard and then the limited overlap. For limited overlap, the vertex allowed for overlap was the same as in the previous test. The discovered patterns by using standard and limited overlap were very similar to those obtained in the first test. In fact, the first and second reported substructures were the same although the third one was different (see Figure 6.9). By using 129

218 standard overlap Subdue reported Mestizos as the third predominant ethnic group. Limited overlap reported a relationship among the Undefined ethnic group and the family type Jefe con Familiares y Agregados. This last pattern was the same one reported in the previous test. a) b) Figure 6.9. Examples of discovered patterns in the use-case El Sagrario (2) Figure 6.10 shows the processing time comparison chart by using the standard and limited overlap features. In this figure, the time taken when using the standard overlap feature is shown in pink color (30,532 seconds) while the processing time taken when using limited overlap is shown in yellow color (2,471). It is important to note that by using our new overlap approach we obtained a time reduction gain of 92% in our experiment. 130

219 2, ,532 Limited Overlap Overlap Seconds Figure Processing time standard vs. limited overlap: use-case El Sagrario (2) Use-case: People living along the borders of the river crossing Puebla downtown As we mentioned at the beginning of the chapter, a river crosses Puebla downtown. Suppose that we are interested in knowing characteristics about family types and ethnic groups of people living along the borders of the river. Therefore, we defined as our study area all blocks located at most 50 meters from the borders of the river as shown in Figure We selected this distance since it allows us to select at least one block along the entire border of the river. We used the following parameters in our use-case: Spatial layer: chpuebla and chrio. Pivot: the river. Spatial relation: distance. o Value: 50 meters. 131

220 Spatial graph-based model: model #2. The generated graph was composed of 16,597 vertices, and 16,596 edges according to the proposed graph-based model. Figure Blocks 50m. from river crossing Puebla downtown. The created graph was used to feed the Subdue system. For this test we selected the following Subdue s parameters: Predefined substructure: yes (we used UH CONTAIN MEMBER since these elements are grouping components in our graph-based representation for the nonspatial data of the census). See Section 5.1 for more details. Overlap: yes. 132

221 Limited overlap: no/yes (first, we used standard overlap; next, we used limited overlap). Figure 6.12 shows two examples of discovered substructures in this use-case. Our domain expert evaluated the generated results focusing in the following issues: there is a modification for the population agglomerative criterion on the East side ( San Francisco, El Alto Xonaca, and Los Remedios neighborhoods) and on the West side ( San José, El Sagrario and El Carmen neighborhoods) of the river. The Mestizos is the predominant ethnic group (24.5%); the Spanish is the next one (almost has the same percentage). This pattern may outline that the Mestizos ethnic group played the role of intermediator between the Spanish and Undefined groups on the West side, and between the Spanish and Indígenas on the East side. a) b) Figure Examples of discovered patterns in use-case people around the river crossing Puebla downtown The previous results were generated using standard overlap, but we needed to compare them with the results obtained using limited overlap, so we used the same parameters as in 133

222 the previous test. We proposed to allow overlap only in vertices representing the ethnic group. It is important to mention that, if our overlap label list has many elements, it could be more efficient (concerning to processing time) to use standard overlap because of the overhead that results from the validation process. Remember that when we use the limited overlap feature, each time that an overlap among instances of a substructure is detected, the overlap is evaluated in order to know if it is allowed or not. In this experiment we noted that the patterns discovered using the standard and limited overlap approaches were the same but the processing time taken by each of them for the mining task was slightly different. Figure 6.13 presents the time comparison chart for the experiment showing the time taken when using standard overlap in pink color (15,572 seconds) and the time taken when using limited overlap in yellow color (14,021 seconds). Limited overlap required a lower time to find the same patterns than standard overlap. 14, ,572 Limited Overlap Overlap Seconds Figure Processing time standard vs. limited overlap: use-case people around the river crossing Puebla downtown. 134

223 The two use-cases presented in this section show the functionality of our proposal for modeling and mining spatial data using a graph-based representation. The discovered patterns in the population census from the year of 1777 were analyzed by a domain expert. Some of these patterns allow the user to validate facts already known. For instance, the predominant ethnic groups classified by parish, and population distribution according to the social status in the zone. But other patterns allow him to know implications, previously unknown, among spatial concepts and non-spatial attributes in the census. For instance, racial and economic interchange among people in the parishes, which are the common characteristics of population living along the borders of the river crossing downtown (on the West and East sides), social structure according to the ethnic group, common regularities among the family type and habitation space. Processing time comparison among standard and limited overlap features was other topic evaluated in these use-cases. We presented time processing comparison charts showing the time reduction gain obtained by using the new approach. We also demonstrated that by using limited overlap we can orient the search over substructures (patterns) containing elements that in our domain may represent relevant issues. For instance, in the year of 1777 the ethnic group represented a significant element to know characteristics about a family and their habitation space. Next section presents a use-case using a Popocatépetl volcano database. The objective in this illustrative use-case is to evaluate/compare the generated results by each proposed graph-based model. 135

224 6.2 Popocatépetl volcano In Section 4.3 we presented a preliminary use-case showing the applicability of our methodology using a Popocatépetl volcano database. This section presents an extended usecase employing the same database. First, we describe the study area, next the method used to define the dataset and the parameters for the spatial data mining processes, and finally the generated results. As already mentioned, the database contains data related to several issues in the Popocatépetl zone such as settlements, rivers, and evacuation roads. Figure 6.14 shows a fragment of the Popocatépetl volcano database. For the experiments we will use three spatial data layers. The first one is the roads layer (representing the roads in the area), the second one is the rivers layer (representing the rivers in the area), and finally the third one is the settlements layer (representing population areas). To illustrate the use-case we have delimited a study zone as shown in the figure. 136

225 Figure Popocatépetl volcano. The experiments will focus on identifying relationships and characteristics shared among settlements, roads and rivers in the study zone. Suppose we want to know characteristics shared by these elements that can help us to implement/evaluate evacuation plans in case of a volcanic contingence. For example, characteristics of roads starting in or crossing a settlement, material used to build those roads and their current status (i.e., paved, unpaved), characteristics of the roads and rivers meeting a relationship (i.e., they cross, touch) in the zone, rivers near to settlements that in case of huge pluvial concentration might represent a potential risk. 137

226 Use-case: Popocatépetl To illustrate the capabilities of our model for modeling and mining spatial data we will use as our study zone that shown in Figure 6.15 (Southwest side of the volcano crater). The experiment is focused on discovering characteristics among roads, rivers, and settlements in the zone. The presentation of the generated results will be structured in the following way: We will present discovered patterns among roads and rivers, roads and settlements, and rivers and settlements in the study zone. We chose this structure because we wanted to illustrate the user s capabilities to organize the data in order to create different sceneries to represent and mine spatial data. Figure Popocatépetl volcano: study zone. Therefore, we used our prototype system to select the river, settlement, and road spatial layers. The next step was to select the spatial relationships to be validated among the spatial 138

227 objects under consideration. To develop our test, we took advantage of a special feature implemented in the Spatial Oracle module (our SDBMS system): Oracle has the capability to examine two geometry objects to determine their topological spatial relationship. Moreover, it is possible to indicate that the same SDBMS determines and returns the topological relationship that best matches the geometries. So, to create our dataset, we first selected the spatial layers to work with and then we evaluated the relationships among the spatial objects contained in the spatial layers. The experiments were implemented based on the following parameters: Spatial layers: rivers, settlements and roads. Spatial relation: topological relations supported by Oracle Spatial. Spatial graph-based model: all models. The experiment was performed using the five proposed graph-based models. The objective was to evaluate the results generated by each of them. Additionally, we ran the test using no overlap, standard overlap and limited overlap features. In the following figures we show the generated results in the experiment. In the figures the generated result by using no overlap is labeled as a), via standard overlap is labeled as b), and through limited overlap is labeled as c). In the case of limited overlap, we told Subdue to allow overlap only for vertices representing roads in the zone because this element represents a primary item in our study domain (evaluation and implementation of population evacuation plans). 139

228 Since our intention in the experiment was to compare the generated results using the proposed graph-based models, we selected (and reported) as the most significant discovered pattern (by using no overlap, standard overlap and limited overlap), the one covering the following restrictions: Complete pattern. A pattern reporting at least two spatial objects (i.e., road and river), the spatial relation among them (i.e., touch), and some non-spatial attribute(s) (i.e., road category unpaved and river category draining ). Maximum number of reported instances. A pattern with the highest score of reported instances of a substructure. The following subsections present the generated results using model 1 to 5. By each model we describe the discovered patterns according to the proposed structure to mine data: road and river, road and settlement, and finally river and settlement. At the end of Section 6.2.2, we present comparison tables and conclusions of the generated results by each model Model #1 - base model Road and River. Figure 6.16 shows the most significant discovered pattern between roads and rivers. The pattern describes a relationship among road category unpaved overlapping a river category draining in the zone. This pattern may be considered as an indicator of the number of roads that need to be supervised in case of a volcanic contingence since the material type they are built with, and because they cross rivers (the lecture may be done in inverse order) that in case of huge pluvial concentration may overflow and make roads useless. Subdue found by using no overlap 46 instances (in the second iteration) of the 140

229 pattern; via standard overlap it found 85 instances (in the first iteration), and through limited overlap it also found 85 instances (in the second iteration). As we can see in the figure, standard and limited overlap found the same number of instances of the pattern, but limited overlap required two iterations to find the same pattern. However, this fact does not mean that standard overlap is better than limited overlap because analyzing the overall processing time required by limited overlap to finish the substructure discovery phase we note that it was lower than the required by standard overlap. 141

230 a) b) c) Figure Relationships among roads and rivers by using model #1. Road and Settlement. The most significant discovered pattern found in this experiment describes a relationship among road category unpaved touching a settlement category construction in the zone as shown in Figure Settlement category construction represents in the Popocatépetl s settlement spatial layer inhabit areas with huge population, buildings and several constructions used to offer services to people. If we assume that 142

231 people may require to be evacuated in case of an eruption and that the roads that will be used are unpaved then this situation may become a problem (i.e., a bottleneck, water and soil may become mud and this may make roads useless). For this experiment Subdue found via no overlap 6 instances (in the ninth iteration) of the pattern, through standard overlap it found 9 instances (in the fourth iteration), and by using limited overlap it discovered 8 instances (in the tenth iteration). In all cases Subdue was able to discover the same pattern; the difference was the number of computed iterations required to discover it. 143

232 a) b) c) Figure Relationships among roads and settlements by using model #1. River and Settlement. Figure 6.18 shows the most significant discovered pattern for these spatial objects. It describes a relationship among river category draining crossing a settlement category either (a) block or (b)(c) construction in the zone. Settlement category block represents in the Popocatépetl s settlement spatial layer inhabit areas but with small population, in fact there exist several uninhabited areas, few buildings and 144

233 constructions. The pattern may be used to identify potential flooding zones because it represents rivers close to (may be some of them crossing) areas inhabited by people. Through no overlap Subdue discovered 5 instances (in the twelfth iteration) of the pattern, by using standard overlap it also found 5 instances (in the eighth iteration), and finally, via limited overlap it also found 5 instances (in the eighth iteration). For this experiment Subdue discovered the same pattern in the three cases, however, by using standard overlap and limited overlap the understanding of the pattern is simpler. 145

234 a) b) c) Figure Relationships among rivers and settlements by using model #1. The previous experiment was done using model 1, now we perform the same experiment using model 2 to 5. We will be focused to compare the same discovered patterns for the three object-object structures (i.e., road-river, road-settlement, and river-settlement). The 146

235 generated results are shown in the figures following the same report schema: those using no overlap are labeled as a), those using standard overlap are labeled as b), and finally those created with limited overlap are labeled as c). The most significant differences between the generated results are the number of reported instances and the number of iterations needed to discover the pattern. More iterations means more processing time to discover the pattern Model #2 - single replication of relation types, complete information Road and River. By means of model #2 Subdue discovered the pattern (among road and river) shown in Figure Subdue found by using no overlap 41 instances (in the second iteration) of the pattern, via standard overlap it found 85 (in the third iteration), and through limited overlap it discovered 64 (in the second iteration). All cases reported road category unpaved and river category draining as the predominant spatial objects. 147

236 a) b) c) Figure Relationships among roads and rivers by using model #2. Road and Settlement. For these spatial objects Subdue was able to discover, by means of model #2, the pattern shown in Figure Via no overlap Subdue found 5 instances (in 148

237 the fourteenth iteration) of the pattern, through standard overlap it discovered 8 (in the sixth iteration), and by using limited overlap it found 7 (in the tenth iteration). No overlap reported settlement category block whereas standard and limited overlap reported more instances of settlement category construction. a) b) c) Figure Relationships among roads and settlements by using model #2. 149

238 River and Settlement. The pattern discovered by means of model #2 among these objects is shown in Figure Through no overlap Subdue found 5 instances (in the sixteenth iteration) of the pattern, by using standard overlap it discovered 10 (in the seventh iteration), and via limited overlap it found 5 (in the fourteenth iteration). No overlap and limited overlap reported settlement category construction whereas limited overlap reported more instances of settlement category block. a) b) c) Figure Relationships among rivers and settlements by using model #2. 150

239 Model #3 - double replication of relation types, no complete information Road and River. The pattern discovered, by means of model #3, for these objects is shown in Figure Through no overlap Subdue found 39 instances (in the second iteration) of the pattern, by using standard overlap it found 85 (in the second iteration), and via limited overlap it discovered 34 (in the ninth iteration). In all cases Subdue reported as the predominant spatial objects road category unpaved and river category draining. 151

240 a) b) c) Figure Relationships among roads and rivers by using model #3. 152

241 Road and Settlement. By means of model #3 Subdue discovered the pattern shown in Figure Subdue found by using no overlap 4 instances (in the fifteenth iteration) of the pattern, via standard overlap it found 8 (in the sixth iteration), and through limited overlap it discovered 5 (in the thirteenth iteration). No overlap and limited overlap reported settlement category block as the predominant spatial object whereas standard overlap reported settlement category construction. a) b) c) Figure Relationships among roads and settlements by using model #3. 153

242 River and Settlement. For these spatial objects Subdue was able to discover, by means of model #3, the pattern shown in Figure Via no overlap Subdue found 5 instances (in the twelfth iteration) of the pattern, through standard overlap it found 19 (in the fourth iteration), and by using limited overlap it found 5 (in the tenth iteration). No overlap and limited overlap reported less instances of settlement category construction whereas standard overlap reported more instances of settlement category block. a) b) c) Figure Relationships among rivers and settlements by using model #3. 154

243 Model #4 - single replication of relation types, no complete information Road and River. For these spatial objects Subdue was able to discover, by means of model #4, the pattern shown in Figure Via no overlap Subdue found 39 instances (in the second iteration) of the pattern, through standard overlap it discovered 85 (in the first iteration), and by using limited overlap it found 60 (in the second iteration). All cases reported as the predominant spatial objects road category unpaved and river category draining. 155

244 a) b) c) Figure Relationships among roads and rivers by using model #4. Road and Settlement. The pattern discovered, by means of model #4, for these objects is shown in Figure Through no overlap Subdue discovered 6 instances (in the twelfth iteration) of the pattern, by using standard overlap it found 8 (in the tenth iteration), and via 156

245 limited overlap it found 8 (in the seventh iteration). The representative spatial objects are road category unpaved and settlement category construction in all cases. a) b) c) Figure Relationships among roads and settlements by using model #4. River and Settlement. By means of model #4 Subdue discovered the pattern shown in Figure Subdue found by using no overlap 5 instances (in the sixth iteration) of the 157

246 pattern, via standard overlap it found 10 (in the sixth iteration), and through limited overlap it found 5 (in the eleventh iteration). No overlap and limited overlap reported less instances of settlement category construction whereas standard overlap reported more instances of settlement category block. a) b) c) Figure Relationships among rivers and settlements by using model #4. 158

247 Model #5 - double replication of relation types, complete information Road and River. The pattern discovered, by means of model #5, for these objects is shown in Figure Through no overlap Subdue discovered 45 instances (in the fifth iteration) of the pattern, by using standard overlap it found 85 (in the first iteration), and via limited overlap it found 45 (in the fifth iteration). Subdue reported in all cases as the representative spatial objects road category unpaved and river category draining. 159

248 a) b) c) Figure Relationships among roads and rivers by using model #5. 160

249 Road and Settlement. By means of model #5 Subdue discovered the pattern shown in Figure Subdue found by using no overlap 6 instances (in the seventh iteration) of the pattern, via standard overlap Subdue could not find a complete pattern (in the figure the category of the settlement is not reported), and through limited overlap it found 7 (in the seventh iteration). No overlap and limited overlap reported as the representative spatial objects road category unpaved and settlement category construction. a) b) c) Figure Relationships among roads and settlements by using model #5. 161

250 River and Settlement. For these spatial objects Subdue was able to discover, by means of model #5, the pattern shown in Figure Via no overlap Subdue discovered 5 instances (in the thirteenth iteration) of the pattern, through standard overlap it found 5 (in the sixth iteration), and by using limited overlap it also found 5 (in the tenth iteration). Subdue reported in all cases river category draining and settlement category construction as the predominant spatial objects. a) b) c) Figure Relationships among rivers and settlements by using model #5. 162

251 Table 6.1 presents a comparison, by each model and overlap feature, among the number of discovered instances (patterns) and the number of iterations needs to discover them. For example, by using model #1, Subdue found via no overlap 46 instances (in the second iteration) of a complete pattern (according to our definition for reporting a complete pattern) involving the spatial objects road-river. A higher score means a model allowing us to discover more instances of a substructure. Remember Subdue reported as the best pattern (by iteration) a substructure with the highest score of discovered instances of that substructure. This comparison is reported by each object-object structure (i.e., roadriver). Note: NO (No overlap), SO (Standard overlap), LO (Limited overlap). Model #1 Model #2 Model #3 Model #4 Model #5 NO SO LO NO SO LO NO SO LO NO SO LO NO SO LO Road-River Instances Iterations Road-Settlement Instances Iterations River-Settlement Instances Iterations Table 6.1. Instances/iterations by each graph-based model: Popocatépetl use-case. 163

252 From the previous table we can see that Subdue (setting no overlap -NO- in all cases) reported using model #1, 46 instances in the second iteration, of a complete pattern among the road and river spatial objects. Using model #2, Subdue reported 41 instances of a complete pattern in the second iteration. Using model #3, Subdue reported 30 instances of a complete pattern in the second iteration and so on. Finally, we can identify that model #1 was the best model to find the highest number of a complete pattern among these spatial objects (shown in pink color in Table 6.2). Model #1 Model #2 Model #3 Model #4 Model #5 NO NO NO NO NO Road-River Instances Iterations Table 6.2. The best model to discover complete patterns among the road and river spatial objects (no overlap) From Table 6.1 we created Table 6.3. This table presents a comparison of maximum/minimum discovered instances by each overlap feature. A model with the highest score is better since it allows discovering more instances of a substructure (patterns). The comparison is presented for each object-object structure (i.e., road-river). For example, we have already mentioned that model #1 was the best model to find complete patterns (setting no overlap) among the road and river spatial objects. Subdue reported 46 discovered instances of a complete pattern in the second iteration (the highest score). See Table 6.1 for details. 164

253 On the other hand, model #3 and model #4 were the worst models to find complete patterns among the road and river spatial objects. For example, using model #3, Subdue only reported 39 instances of a complete pattern in the second iteration. Maximum Minimum Road-River No overlap model #1 (second iteration) models #3 and #4 (second iteration) Standard overlap models #1, #4 and #5 (first iteration) model #2 (third iteration) Limited overlap model #1 (second iteration) model #3 (ninth iteration) Road-Settlement No overlap model #5 (seventh iteration) model #3 (fifteenth iteration) Standard overlap model #1 (fourth iteration) model #5 (no complete pattern) Limited overlap model #4 (seventh iteration) model #3 (thirteenth iteration) River-Settlement No overlap model #4 (sixth iteration) model #2 (sixteenth iteration). Standard overlap model #3 (fourth iteration) model #1 (eighth iteration). Limited overlap model #1 (eighth iteration) model #2 (fourteenth iteration) Table 6.3. Max/Min of discovered instances by object-object /overlap feature. From Table 6.3 we can identify that model #1 was the best model to discover complete patterns in most of the cases. The second best most was model #4. On the other hand, the worst model to discover complete patterns was model #3. Table 6.4 presents a comparison among the average of discovered instances by model. Higher score means a model allowing us to discover more instances of a substructure (patterns). Each value represents the average of discovered substructures by using no 165

254 overlap, standard overlap and limited overlap features. For example, the value 72.0 in column Model #1 and row Road-River is the average computed from the values 46, 85 and 85 obtained from Table 6.1. The comparison is reported by each object-object structure (i.e., road-river). We can see in the table that model #1 was the best model to find complete patterns. It was followed by model #2. The worst model was model #5. Model #1 Model #2 Model #3 Model #4 Model #5 Road-River Road-Settlement River-Settlement Table 6.4. Average of discovered instances by model/ object-object. Table 6.5 presents a comparison among the average of discovered instances by model. For example, the value 19.0 in column Model #1 no overlap (NO) is the average computed from the values 46, 6 and 5 obtained from Table 6.1. Higher score means a model allowing us to discover more instances of a substructure (patterns). The comparison is reported by each overlap feature. Model #1 Model #2 Model #3 Model #4 Model #5 NO SO LO NO SO LO NO SO LO NO SO LO NO SO LO Table 6.5. Average of discovered instances by model/overlap feature. Finally, Table 6.6 presents a comparison among the average of discovered instances by model. For example, the value 28.2 in column Model #1 is the average computed from the values 19.0, 33.0 and 32.7 obtained from Table 6.5. We can see in the table that model 166

255 #1 reported the highest score of discovered instances (according to our parameters for reporting complete instances) in this illustrative Popocatépetl domain. The following ones were model #2 and model #4 respectively. The worst model to find complete patterns as proposed was model #5. Model #1 Model #2 Model #3 Model #4 Model # Table 6.6. Average of discovered instances by model. As part of our strategy for testing our graph-based methodology for modeling and mining spatial data, we proposed five graph-based operative models derived from our general schema. However, from the analysis and interpretation of the results presented in this section we can conclude that model #1 and model #2 are (following that order) the two graph-based models that produce the best results. This conclusion is based on the quantitative and qualitative analysis implemented using this test domain. Additionally, this appreciation is also supported by the results obtained from the discovered patterns in the population census from the year of 1777 in Puebla downtown test domain described in the previous section. 6.3 Conclusion In this chapter we have presented three illustrative use-cases of our proposal for modeling and mining spatial data. The test domains were two spatial databases, the first one related to 167

256 a population census from the year of 1777 in Puebla downtown, and the second one related to the Popocatépetl volcano. The use-cases developed using the population census database were focused on exemplifying the graph-based model to represent together spatial data, non-spatial data and spatial relations, the new limited overlap feature implemented in the Subdue system (processing time and specialized overlapping pattern oriented search), and the generated results by the mining phase. We have presented in each use-case evaluations performed by the domain expert over the discovered patterns. The use-case developed using the Popocatépetl database was focused on the comparison/evaluation of the generated results by each proposed graph-based model. The tests were implemented upon the supposition of evacuation plans implementation in case of volcanic contingences. We presented comparison tables describing which model(s) allow(s) to discover more instances of a substructure via no overlap, standard overlap, and limited overlap features. Subdue reported as the best pattern (by iteration) a substructure with the highest score of discovered instances of that substructure. Next chapter presents conclusion about our research work and final remarks. 168

257 Chapter 7 CONCLUSIONS The continuous interaction among people and their natural home, the Planet Earth, generates everyday new requirements associated to spatial data. For example, urban analysis, natural risks prevention, space exploration, contamination in oceans, and reforestation of lands just to mention some of them. Spatial data mining involves the integration of methods from different scientific fields which help us, by means of data analysis and discovery algorithms, to produce a particular enumeration of patterns from spatial data. In Chapter 2 we presented several approaches developed for mining spatial data (i.e., generalization-based methods, clustering, spatial associations, approximation and aggregation, mining in image databases, spatial classification, and spatial trend detection). However, our argumentation about those approaches was that they do not consider all the elements found in a spatial database (spatial data, non-spatial data and spatial relations among the spatial objects) in an extended way. We proposed in this dissertation a new approach based on graphs. Our feeling is that if we are able to represent those elements as a unique dataset and if we are also able to mine them as a whole, then, we might be able to 169

258 get patterns that might contain both types of data and spatial relationships enhancing the quality of the results since the generated pattern(s) would describe (a) spatial object(s) meeting (a) spatial relation(s) with other spatial object(s) and which is(are) that (those) relationship(s). We proposed to use a graph-based representation since it provides the desirable flexibility to describe these elements and their relations together. As mentioned, in our model spatial relations among spatial objects are included since a significant characteristic of spatial data is the influence of the neighbors of an object may have on the object itself. In the model we included the representation of three types of spatial relations. Derived from the general graph-based schema we have proposed five operative models. Three aspects define the characteristics of a graph created from those models: first, the representation of equivalent spatial relations (the relation A touch B can be represented by two directed edges, A B and B A, or by one undirected edge, A B; we used the second approach). Second, the representation of symmetric spatial relations (the relation A North_of B implies a relation B South_of A, some models represent only the first relation and other both relations). The third aspect is the way to represent the objects and their relationships. Our intention is to represent the spatial objects and their relation as much as possible but also considering a balance among the representative of the data and the complexity of the created graph. This last issue has a major importance since the tie among the complexity of the created graph and the mining phase. For example, huge graphs may require more computational resources than small graphs, but by creating small graphs we may loss data representativeness and perhaps we may not to discover significant patterns. 170

259 As component of our methodology for mining spatial data using a graph-based approach, we used the Subdue system as our mining tool. The overlap feature plays a relevant role in the Subdue s substructure discovering system. But, as we described, it is implemented in an orthodox way: allows overlap among any instances of a substructure or allows overlap among all instances of a substructure. The first option is better regarding to the processing time, but the second one may discover more instances of a substructure (patterns). Both cases do not give to the user the capability to specify the set of vertices allowed for overlapping. Therefore, we proposed a new overlap approach named limited overlap. The new approach gives to the user the means to specify over which vertices the overlap will be allowed. These vertices may represent significant elements in the context we work with. For example, in the use-case presented in Section 6.2 (implementation of evacuation plans in case of volcanic contingences in the Popocatépetl volcano zone) vertices representing roads were allowed for overlapping since these spatial objects represent relevant elements in the evacuation plans. Moreover, we visualized three motivation issues to propose the implementation of the new approach. First, we demonstrated that by using limited overlap we obtain a search space reduction in the substructure discovering process since we allow overlap but it is restricted to the set of elements chosen by the user. Second, as result of a search space reduction we also get a processing time reduction. Remember that as part of the substructure discovery process there exist a validation and discarding phase over the instances of a substructure. 171

260 Instances that are not discarded may become candidates to further expansion in order to discover new substructures. Third, giving the user the capability to choose the set of vertices allowed for overlapping, we are orienting the search over particular overlapping instances and, at the same time, discarding irrelevant overlapping instances. In order to show the feasibility, capacity to mine and to discover patterns by using a graphbased approach as proposed, we developed a prototype system implementing our model to create graph-based datasets, to mine those datasets (by calling the Subdue system), and to visualize the discovered patterns. In Chapter 6 we described three illustrative use-cases showing the applicability of our proposal. We used two real world domains: a population census from the year of 1777 in Puebla downtown, and a Popocatépetl volcano database. The generated results from those test domains give us a panorama about how and what we can achieve using our approach. It is important to remark the fact that we can use this approach in any domain that can be represented as a graph. In this context, we visualize perspectives related to enhance our work in issues such as the graph-based model for creating the graphs, the data mining algorithm, and the prototype system. They include some of the following: Visualization/presentation of discovered knowledge. For example, visualization of discovered knowledge over the spatial layers (patterns over the maps); to use charts for 172

261 showing comparison among patterns; give the user the capability to navigate through the discovered patterns hierarchy using a hypergraph approach. Enhancing the algorithms used to create the graph-based datasets according to the proposed models. The validation of the spatial relations among the spatial object is phase that in most of the cases requires a lot of computational resources. (i.e., creation and manage of spatial indices, creation of bounding box for the spatial objects, manage of spatial reference systems, etc.) Therefore, the algorithms must as most as possible to execute efficiently this task. Mining the graph. We proposed and used the Subdue system as our data mining tool, moreover we implemented a new algorithm name limited overlap. Subgraph isomorphism is an NP-complete problem, so we must be able to face this problem in order that our processing times for discovering knowledge meet acceptable parameters of efficiency. Relationships among non-spatial data describing spatial objects. Implicit relations among non-spatial data (i.e., attributes) describing the spatial objects may be included in the model in order to enhance the representativeness of the data. Spatial data mining is a promising research field. Several approaches have been developed, and without any doubt, new approaches will be proposed; it is a field in continuous improvement. Our contribution to the spatial data mining goes in that direction. 173

262 BIBLIOGRAPHY [1] AGRAWAL Rakesh, IMIELINSKI Tomasz, SWAMI Arun. Mining Association Rules between Sets of Items in Large Databases. In : Proc. of the 1993 International Conference on Management of Data, Washington D.C.. New York, NY : Association for Computing Machinery Press, 1993, pp ISBN [2] BERNHARDSEN Tor. Geographic Information Systems: An Introduction. 2 nd ed. New York : John Wiley & Son Inc, 1999, 448 p. ISBN [3] CHEIN Michel, MUGNIER Marie-Laure. Conceptual Graphs: Fundamental Notions. Revue d intelligence artificielle, 1992, vol. 6, n o 4, pp [4] CHEN Ming-Syan, HAN Jiawei, YU Philip S. Data Mining: An Overview from a Database Perspective. Institute of Electrical and Electronics Engineers, Transactions on Knowledge and Data Engineering, 1996, vol. 8, n o 6, pp ISSN

263 [5] DIESTEL Reinhard. Graph Theory [on line]. 3 rd ed. New York : Springer-Verlag, Available on :< (last visit ). ISBN [6] EGENHOFER Max J. A Model for Detailed Binary Topological Relationships. Geomatica, 1993, vol. 47, n o 3-4, pp [7] EGENHOFER Max J., FRANZOSA Robert D. On the Equivalence of Topological Relations. International Journal of Geographical Information Systems, 1995, vol. 9, n o 2, pp [8] EGENHOFER Max J., HERRING John R. Categorizing Binary Topological Relationships between Regions, Lines, and Points in Geographic Databases. Technical Report. Department of Surveying Engineering, University of Maine, Orono, 1991, 28 p. [9] ESTER Martin, FROMMELT Alexander, KRIEGEL Hans-Peter, SANDER Jörg. Spatial Data Mining: Database Primitives, Algorithms and Efficient DBMS Support. Data Mining and Knowledge Discovery, 2000, vol. 4, n o 2-3, pp ISSN [10] ESTER Martin, FROMMELT Alexander, KRIEGEL Hans-Peter, SANDER Jörg. Algorithms for Characterization and Trend Detection in Spatial Databases. In : Proc. of the 4 th International Conference on Knowledge Discovery and Data Mining, New York, NY, 1998, pp

264 [11] ESTER Martin, KRIEGEL Hans-Peter, SANDER Jörg. Spatial Data Mining: A Database Approach. In : Proc. of the 5 th International Symposium on Large Spatial Databases, Berlin, Germany, July 1997, pp [12] FAYYAD Usama M., PIATETSKY-SHAPIRO Gregory, SMYTH Padhraic. Knowledge Discovery and Data Mining: Towards a Unifying Framework. In : SIMOUDIS Evangelos, HAN Jiawei, FAYYAD Usama M. Eds. Proc. of the 2 nd International Conference on Knowledge Discovery and Data Mining, Portland, Oregon. Menlo Park, CA : American Association for Artificial Intelligence Press, 1996, pp [13] FAYYAD Usama M., PIATETSKY-SHAPIRO Gregory, SMYTH Padhraic, UTHURUSAMY Ramasamy Eds. Advances in Knowledge Discovery and Data Mining. Menlo Park, CA : American Association for Artificial Intelligence/Massachusetts Institute of Technology Press, 1996, 625 p. ISBN [14] FAYYAD Usama M., SMYTH Padhraic. Image Database Exploration: Progress and Challenges. In : Proc. of the 1993 Workshop on Knowledge Discovery in Databases. Washington D.C., July 1993, pp [15] FAYYAD Usama M., WEIR Nicholas, DJORGOVSKI S. SKICAT: A Machine Learning System for Automated Cataloging of Large Scale Sky Survey. In : UTGOFF Paul E. Eds. Proc. of the 10 th International Conference on Machine Learning, 1993, University of Massachusetts. San Mateo, CA : Morgan Kaufmann, 1993, pp

265 [16] FRAWLEY William J., PIATETSKY-SHAPIRO Gregory, MATHEUS Christopher J. Knowledge Discovery in Databases: An Overview. In : PIATETSKY-SHAPIRO Greogry, FRAWLEY William J. Knowledge Discovery in Databases. Menlo Park CA : American Association for Artificial Intelligence/Massachusetts Institute of Technology Press, 1991, pp [17] GONZALEZ Jesus A. Empirical and Theoretical Analysis of Relational Concept Learning Using a Graph-based Representation. PhD Thesis. Arlington : The University of Texas at Arlington, 2001, 138 p. [18] GONZALEZ Jesus A., HOLDER Lawrence B., COOK Diane J. Structural Knowledge Discovery Used to Analyze Earthquake Activity. In : Proc. of the 13 th Annual Florida Artificial Intelligence Research Symposium. American Association for Artificial Intelligence Press, 2000, pp [19] GRAPHVIZ - AT&T RESEARCH. Graphviz - Graph Visualization Software [on line]. Available on : < (last visit ). [20] GUTTMAN Antonin. R-Trees: A Dynamic Index Structure for Spatial Searching. In : Proc. of the 1984 International Conference on Management of Data, Boston, Massachusset, New York, NY : Association for Computing Machinery Press, June 1984, pp

266 [21] HAN Jiawei, CAI Yandong, CERCONE Nick. Knowledge Discovery in Databases: An Attribute-Oriented Approach. In : YUAN Li Yan, Proc. of the 18 th International Conference on Very Large Databases, Vancouver, British Columbia, Canada. San Mateo : Morgan Kaufmann Publishers, 1992, pp [22] HAN Jiawei, FU Yongjian. Exploration of the Power of Attribute-Oriented Induction in Data Mining. In : [13]. [23] HAN Jiawei, KAMBER Micheline. Data Mining: Concepts and Techniques. San Francisco : Morgan Kaufmann, 2000, 550 p. ISBN [24] HOLDER Lawrence B., COOK Diane J., GONZALEZ Jesus A., JONYER I. Structural Pattern Recognition in Graphs. In : CHEN Dechang, CHENG Xiuzhen Eds. Pattern Recognition and String Matching. Dordrecht : Kluwer Academic Publishers, 2002, 772 p. ISBN [25] HOLSHEIMER Marcel, KERSTEN Martin L. Architectural Support for Data Mining. CS-R9429. Amsterdam, The Netherlands : CWI (Centre for Mathematics and Computer Science), [26] KAUFMAN Leonard, ROUSSEEUW Peter J. Finding Groups in Data: An Introduction to Cluster Analysis. New York : Wiley-Interscience, 1990, 368 p. ISBN

267 [27] KNORR Edwin M., NG Raymond T. Finding Aggregate Proximity Relationships and Commonalities in Spatial Data Mining. Institute of Electrical and Electronics Engineers, Transactions on Knowledge and Data Engineering, 1996, vol. 8, n o 6, pp [28] KOLATCH Erica. Clustering Algorithms for Spatial Databases: A Survey. University of Maryland, College Park : Department of Computer Science, 2001, 22 p. [29] KOPERSKI Krzysztof, HAN Jiawei. Discovery of Spatial Association Rules in Geographic Information Databases. In : Proc. of the 4 th International Symposium on Large Spatial Databases, Portland, Maine, August 1995, pp [30] KOPERSKI Krzysztof, HAN Jiawei, ADHIKARY Junas. Spatial Data Mining: Progress and Challenges. In : Proc. of the Workshop on Research Issues on Data Mining and Knowledge Discovery, Montreal, Canada, June 1996, pp [31] KOPERSKI Krzysztof, HAN Jiawei, STEFANOVIC Nebojsa. An Efficient Two- Step Method for Classification of Spatial Data. In : Proc. of the 8 th Symposium on Spatial Data Handling, Vancouver, Canada, 1998, pp [32] LAURINI Robert. Information Systems for Urban Planning: A Hypermedia Cooperative Approach. London : Taylor and Francis, 2001, 368 p. ISBN

268 [33] LAURINI Robert, MILLERET-RAFFORT F. Les bases de données en géomatique. Paris : HERMES, 1993, 330 p. [34] LAURINI Robert, THOMPSON Derek. Fundamentals of Spatial Information Systems. London : Academic Press, 1992, 680 p. (The A.P.I.C. series, n o 37) ISBN [35] LU Wei, HAN Jiawei, OOI Beng C. Discovery of General Knowledge in Large Spatial Databases. In : Proc. of the Far East Workshop on Geographic Information Systems, Singapore, June 1993, pp [36] MUGNIER Marie-Laure. On Generalization/Specialization for Conceptual Graphs. Journal of Experimental and Theoretical Artificial Intelligence, 1995, vol. 7 pp [37] NG Raymond T., HAN Jiawei. Efficient and Effective Clustering Methods for Spatial Data Mining. In : Proc. of the 20 th Very Large Databases Conference, Santiago, Chile, San Francisco, CA, 1994, pp [38] PECH PALACIO Manuel, SOL David, GONZALEZ Jesus A. Adaptation and Use of Spatial and non-spatial Data Mining. Proc. of the International 2002 Workshop Semantic Processing of Spatial Data, Centre for Computing Research, Instituto Politécnico Nacional, México D.F., December ISBN X 180

269 [39] PREPARATA Franco P., SHAMOS Michael Ian. Computacional Geometry: An Introduction. 1 st ed. New York, NY : Springer-Verlag, p. ISBN [40] SHEIKHOLESLAMI Gholamhosein, CHATTERJEE Surojit, ZHANG Aidong. WaveCluster: A Multi-Resolution Clustering Approach for Very Large Spatial Databases. In : GUPTA Ashish, SHMUELI Oded, WIDOM Jennifer Eds. Proc. of the 24 th International Conference on Very Large Databases, New York, NY, San Francisco, CA : Morgan Kaufmann Publishers, 1998, pp ISBN [41] SHEK Eddie C., MUNTZ Richard R., MESROBIAN Edmond, NG Kenneth. Scalable Exploratory Data Mining of Distributed Geoscientific Data. In : Proc. of the 2 nd International Conference on Knowledge Discovery and Data Mining, Portland, OR, Menlo Park, CA : American Association for Artificial Intelligence Press, 1996, pp [42] SOL David, LOYO E., RAZO Antonio. Risk Management in the Popocatépetl Volcano Zone. In : Workshop on Advanced Techniques for the Assessment of Natural Hazards in Mountain Areas, Innsbruck, Austria, June 2000, pp [43] STOLORZ Paul E, DEAN Christopher. Quakefinder: A Scalable Data Mining System for Detecting Earthquakes from Space. In : Proc. of the 2 nd International Conference on Data Mining, Portland, Oregon, Menlo Park, CA : American Association for Artificial Intelligence Press, 1996, pp

270 [44] STOLORZ Paul E., NAKAMURA H., MESROBIAN Edmond, MUNTZ Richard R., SHEK Eddie C., SANTOS J. R., YI J., Ng K., CHIEN S. Y., MECHOSO C. R., FARRARA J. D. Fast Spatio-Temporal Data Mining of Large Geophysical Datasets. In : Proc. of the 1 st International Conference on Data Mining, Montreal, Canada, Menlo Park, CA : American Association for Artificial Intelligence Press, 1995, pp [45] SUBDUE SYSTEM - THE UNIVERSITY OF TEXAS AT ARLINGTON. SUBDUE - Graph Based Knowledge Discovery [on line]. Available on : < (last visit ). [46] ZHANG Tian, RAMAKRISHNAN Raghu, LIVNY Miron. BIRCH: An Efficient Data Clustering Method for Very Large Databases. In : Proc. of the 1996 International Conference on Management of Data, Montreal, Canada, New York, NY : Association for Computing Machinery Press, June 1996, pp ISSN

271 Appendix C AGREEMENTS UDLAP/INSA DE LYON lxix

272 lxx

273 lxxi

274 lxxii

275 lxxiii

276 FOLIO ADMINISTRATIF THESE SOUTENUE DEVANT L'INSTITUT NATIONAL DES SCIENCES APPLIQUEES DE LYON NOM : PECH PALACIO DATE de SOUTENANCE : 12/12/2005 (avec précision du nom de jeune fille, le cas échéant) Prénoms : Manuel Alfredo TITRE : Spatial Data Modeling and Mining using a Graph-based Representation NATURE : Doctorat Numéro d'ordre : 05 ISAL Ecole doctorale : Informatique et Information pour la Société Spécialité : Documents Multimédia, Images et Systèmes d'information Cote B.I.U. - Lyon : T 50/210/19 / et bis CLASSE : RESUME : We propose a unique graph-based model to represent spatial data, non-spatial data and the spatial relations among spatial objects. We will generate datasets composed of graphs with a set of these three elements. We consider that by mining a dataset with these characteristics a graph-based mining tool can search patterns involving all these elements at the same time improving the results of the spatial analysis task. A significant characteristic of spatial data is that the attributes of the neighbors of an object may have an influence on the object itself. So, we propose to include in the model three relationship types (topological, orientation, and distance relations). In the model the spatial data (i.e. spatial objects), non-spatial data (i.e. non-spatial attributes), and spatial relations are represented as a collection of one or more directed graphs. A directed graph contains a collection of vertices and edges representing all these elements. Vertices represent either spatial objects, spatial relations between two spatial objects (binary relation), or non-spatial attributes describing the spatial objects. Edges represent a link between two vertices of any type. According to the type of vertices that an edge joins, it can represent either an attribute name or a spatial relation name. The attribute name can refer to a spatial object or a non-spatial entity. We use directed edges to represent directional information of relations among elements (i.e. object x touches object y) and to describe attributes about objects (i.e. object x has attribute z). We propose to adopt the Subdue system, a general graph-based data mining system developed at the University of Texas at Arlington, as our mining tool. A special feature named overlap has a primary role in the substructures discovery process and consequently a direct impact over the generated results. However, it is currently implemented in an orthodox way: all or nothing. Therefore, we propose a third approach: limited overlap, which gives the user the capability to set over which vertices the overlap will be allowed. We visualize directly three motivations issues to propose the implementation of the new algorithm: search space reduction, processing time reduction, and specialized overlapping pattern oriented search. MOTS-CLES : KDD, Spatial data mining, Graph, Spatial data, Spatial relation, Geoprocessing Laboratoire (s) de recherche : LIRIS - Laboratoire d'informatique en Images et Systèmes d'information / CENTIA-UDLAP, Mexique Directeur de thèse: Robert LAURINI, Professeur, INSA de Lyon, France, Directeur Président de jury : Eduardo MORALES, Profesor Titular ITESM-Morelos, Mexique, Examinateur Composition du jury : Nicandro CRUZ François FAGES Jesús GONZALEZ Robert LAURINI Hervé MARTIN Eduardo MORALES David SOL Anne TCHOUNIKINE Profesor Titular, Universidad Veracruzana, Mexique, Rapporteur Directeur de Recherches, INRIA, Paris, France, Examinateur Profesor Titular INAOE, Mexique, Co-Directeur Professeur, INSA de Lyon, France, Directeur Professeur Université Joseph Fourier, Grenoble, France, Rapporteur Profesor Titular ITESM-Morelos, Mexique, Examinateur Profesor Titular UDLAP, Mexique, Directeur Maître de Conférences, INSA de Lyon, France, Co-Directeur

Montrer encore