Nouvelles classes de problèmes pour la fouille de motifs intéressants dans les bases de données 2

Documents pareils

APPENDIX 6 BONUS RING FORMAT

Arithmetical properties of idempotents in group algebras

Exercices sur SQL server 2000

Le passé composé. C'est le passé! Tout ça c'est du passé! That's the past! All that's in the past!

Exemple PLS avec SAS

Instructions pour mettre à jour un HFFv2 v1.x.yy v2.0.00

Instructions Mozilla Thunderbird Page 1

Improving the breakdown of the Central Credit Register data by category of enterprises

INSERTION TECHNIQUES FOR JOB SHOP SCHEDULING

MELTING POTES, LA SECTION INTERNATIONALE DU BELLASSO (Association étudiante de lʼensaparis-belleville) PRESENTE :

Small Businesses support Senator Ringuette s bill to limit credit card acceptance fees

Contrôle d'accès Access control. Notice technique / Technical Manual

1.The pronouns me, te, nous, and vous are object pronouns.

Cette Leçon va remplir ces attentes spécifiques du curriculum :

GAME CONTENTS CONTENU DU JEU OBJECT OF THE GAME BUT DU JEU

INVESTMENT REGULATIONS R In force October 1, RÈGLEMENT SUR LES INVESTISSEMENTS R En vigueur le 1 er octobre 2001

IPSAS 32 «Service concession arrangements» (SCA) Marie-Pierre Cordier Baudouin Griton, IPSAS Board

Contents Windows

0 h(s)ds et h [t = 1 [t, [ h, t IR +. Φ L 2 (IR + ) Φ sur U par

CLIM/GTP/27/8 ANNEX III/ANNEXE III. Category 1 New indications/ 1 re catégorie Nouvelles indications

Application Form/ Formulaire de demande

Support Orders and Support Provisions (Banks and Authorized Foreign Banks) Regulations

Data issues in species monitoring: where are the traps?

iqtool - Outil e-learning innovateur pour enseigner la Gestion de Qualité au niveau BAC+2

Surveillance de Scripts LUA et de réception d EVENT. avec LoriotPro Extended & Broadcast Edition

Cheque Holding Policy Disclosure (Banks) Regulations. Règlement sur la communication de la politique de retenue de chèques (banques) CONSOLIDATION

Practice Direction. Class Proceedings

Lamia Oukid, Ounas Asfari, Fadila Bentayeb, Nadjia Benblidia, Omar Boussaid. 14 Juin 2013

AMENDMENT TO BILL 32 AMENDEMENT AU PROJET DE LOI 32

RULE 5 - SERVICE OF DOCUMENTS RÈGLE 5 SIGNIFICATION DE DOCUMENTS. Rule 5 / Règle 5

TP: Représentation des signaux binaires. 1 Simulation d un message binaire - Codage en ligne

DOCUMENTATION MODULE BLOCKCATEGORIESCUSTOM Module crée par Prestacrea - Version : 2.0

Paxton. ins Net2 desktop reader USB

First Nations Assessment Inspection Regulations. Règlement sur l inspection aux fins d évaluation foncière des premières nations CONSOLIDATION

Eléments de statistique

Bigdata et Web sémantique. les données + l intelligence= la solution

Credit Note and Debit Note Information (GST/ HST) Regulations

Dans une agence de location immobilière...

calls.paris-neuroscience.fr Tutoriel pour Candidatures en ligne *** Online Applications Tutorial

Tarification et optimisation pour le marketing

The new consumables catalogue from Medisoft is now updated. Please discover this full overview of all our consumables available to you.

Notice Technique / Technical Manual

that the child(ren) was/were in need of protection under Part III of the Child and Family Services Act, and the court made an order on

Technologies quantiques & information quantique

UML : Unified Modeling Language

POLICY: FREE MILK PROGRAM CODE: CS-4

Railway Operating Certificate Regulations. Règlement sur les certificats d exploitation de chemin de fer CODIFICATION CONSOLIDATION

BILL 203 PROJET DE LOI 203

I. Programmation I. 1 Ecrire un programme en Scilab traduisant l organigramme montré ci-après (on pourra utiliser les annexes):

Project 1 Experimenting with Simple Network Management Tools. ping, traceout, and Wireshark (formerly Ethereal)

Compléter le formulaire «Demande de participation» et l envoyer aux bureaux de SGC* à l adresse suivante :

TARIFS PIETONS SAISON PEDESTRIAN RATES SEASON 2014/2015 TARIFS ASSURANCE CARRE NEIGE CARRE NEIGE INSURANCE RATES 2014/2015

LOGO. Module «Big Data» Extraction de Connaissances à partir de Données. Claudia MARINICA MCF, ETIS UCP/ENSEA/CNRS

INDIVIDUALS AND LEGAL ENTITIES: If the dividends have not been paid yet, you may be eligible for the simplified procedure.

Package Contents. System Requirements. Before You Begin

WINTER BOAT STORAGE SYSTEM SYSTÈME DE REMISAGE HIVERNAL POUR BATEAU

ETABLISSEMENT D ENSEIGNEMENT OU ORGANISME DE FORMATION / UNIVERSITY OR COLLEGE:

: Machines Production a créé dès 1995, le site internet

Stratégie DataCenters Société Générale Enjeux, objectifs et rôle d un partenaire comme Data4

General Import Permit No. 13 Beef and Veal for Personal Use. Licence générale d importation n O 13 bœuf et veau pour usage personnel CONSOLIDATION

Ordonnance sur le paiement à un enfant ou à une personne qui n est pas saine d esprit. Infant or Person of Unsound Mind Payment Order CODIFICATION

PARIS ROISSY CHARLES DE GAULLE

Utiliser une WebCam. Micro-ordinateurs, informations, idées, trucs et astuces

Face Recognition Performance: Man vs. Machine

Name Use (Affiliates of Banks or Bank Holding Companies) Regulations

EXISTENCE OF ALGEBRAIC HECKE CHARACTERS ( C. R. ACAD. SCI. PARIS SER. I MATH, 332(2001) ) Tonghai Yang

THE SUBJUNCTIVE MOOD. Twenty-nineth lesson Vingt-neuvième leçon

Règlement relatif à l examen fait conformément à la Déclaration canadienne des droits. Canadian Bill of Rights Examination Regulations CODIFICATION

Bill 69 Projet de loi 69

DOCUMENTATION - FRANCAIS... 2

Dis où ces gens vont d après les images / Tell where these people are going based on the pictures.

IDENTITÉ DE L ÉTUDIANT / APPLICANT INFORMATION

Comprendre l impact de l utilisation des réseaux sociaux en entreprise SYNTHESE DES RESULTATS : EUROPE ET FRANCE

PRACTICE DIRECTION ON THE LENGTH OF BRIEFS AND MOTIONS ON APPEAL

Critères à l attention des fabricants et des fournisseurs de biens ou de services : dispositifs mécaniques pour bingo

3615 SELFIE. HOW-TO / GUIDE D'UTILISATION

How to Login to Career Page

Algorithmes de recommandation, Cours Master 2, février 2011

BNP Paribas Personal Finance

RISK-BASED TRANSPORTATION PLANNING PRACTICE: OVERALL METIIODOLOGY AND A CASE EXAMPLE"' RESUME

LOI SUR LA RECONNAISSANCE DE L'ADOPTION SELON LES COUTUMES AUTOCHTONES ABORIGINAL CUSTOM ADOPTION RECOGNITION ACT

Big Data et Graphes : Quelques pistes de recherche

THE LAW SOCIETY OF UPPER CANADA BY-LAW 19 [HANDLING OF MONEY AND OTHER PROPERTY] MOTION TO BE MOVED AT THE MEETING OF CONVOCATION ON JANUARY 24, 2002

Modélisation géostatistique des débits le long des cours d eau.

Tex: The book of which I'm the author is an historical novel.

Ships Elevator Regulations. Règlement sur les ascenseurs de navires CODIFICATION CONSOLIDATION. C.R.C., c C.R.C., ch. 1482

Natixis Asset Management Response to the European Commission Green Paper on shadow banking

La Routine Quotidienne. Le docteur se lave les mains

DOCUMENTATION - FRANCAIS... 2

Folio Case User s Guide

Démonstration de la conjecture de Dumont

Deadline(s): Assignment: in week 8 of block C Exam: in week 7 (oral exam) and in the exam week (written exam) of block D

WEB page builder and server for SCADA applications usable from a WEB navigator

THE EVOLUTION OF CONTENT CONSUMPTION ON MOBILE AND TABLETS

Interest Rate for Customs Purposes Regulations. Règlement sur le taux d intérêt aux fins des douanes CONSOLIDATION CODIFICATION

Tier 1 / Tier 2 relations: Are the roles changing?

Stakeholder Feedback Form January 2013 Recirculation

Tammy: Something exceptional happened today. I met somebody legendary. Tex: Qui as-tu rencontré? Tex: Who did you meet?

Principe de TrueCrypt. Créer un volume pour TrueCrypt

valentin labelstar office Made-to-measure label design. Conception des étiquettes sur mesure. Quality. Tradition. Innovation DRUCKSYSTEME

Transcription:

Nouvelles classes de problèmes pour la fouille de motifs intéressants dans les bases de données 2 Lhouari Nourine 1 1 Université Blaise Pascal, CNRS, LIMOS, France SeqBio 2012 Marne la vallée, France 2. Financé par le programme DEFIS 2009 de l ANR Projet DAG

Nouvelles classes de problèmes pour la fouille de motifs intéressants dans les bases de données 3 2/24 2 / 24 Context : Mining interesting patterns in databases Plenty of contributions over the last 20 years 1 Patterns : itemsets, sequences, trees, graphs, functional dependencies, queries... 2 Databases : Relational DB, Transactional DB, XML DB... or just a collection of patterns (supposed to be large) 3 Interestingness criteria : frequency (and variants), satisfaction of some predicates Define a wide class of enumeration problems, some being studied for years in combinatorics, AI and databases Frequent itemset mining (FIM) : The most studied problem in data mining

3/24 3 / 24 Nouvelles classes de problèmes pour la fouille de motifs intéressants dans les bases de données 4 Preliminaries Plan 1 Preliminaries 2 3 Concluding remarks

4/24 4 / 24 Nouvelles classes de problèmes pour la fouille de motifs intéressants dans les bases de données 5 Preliminaries Notations (Mannila and Toivonen, DMKD, 1997) A pattern mining problem : L : set of patterns, a partial order on L. d : a database Q : a monotonic predicate to qualify interesting patterns X in d, noted Q(X,d). Several questions can be asked : 1 Frequent patterns 2 Closed frequent patterns 3 Positive and Negative borders denoted by Bd + () and Bd (). Dualization Relationship between the two borders

Maximal red elements is known as a Positive border 5/24 Minimal blue elements is known as a Negative border 5 / 24 Nouvelles classes de problèmes pour la fouille de motifs intéressants dans les bases de données 6 Preliminaries Dualization Hypothesis L is structured with a partial order. Q is anti-monotonic wrt : for all θ, ϕ L, ϕ θ implies Q(θ) Q(ϕ).

6/24 6 / 24 Nouvelles classes de problèmes pour la fouille de motifs intéressants dans les bases de données 7 Preliminaries Dualization Problem on Boolean Lattices Bd ({A, BC}) ={D, AB, AC}.

7/24 7 / 24 Nouvelles classes de problèmes pour la fouille de motifs intéressants dans les bases de données 8 Preliminaries (L, ) Isomorphic to a boolean lattice Basic ideas : A bijection f between patterns L and the powerset of some finite set R and Isomorphism between (L, ) and (2 R, ) RAS = The class of pattern mining problems for which a representation as sets exists

8/24 8 / 24 Nouvelles classes de problèmes pour la fouille de motifs intéressants dans les bases de données 9 Preliminaries (L, ) Isomorphic to a boolean lattice Results for RAS problems : [Mannila &Toivonen, DMKD, 1997] Dualization is equivalent to minimal transversal Hypergraph, All algorithms for itemset mining can be used for RAS Complexity depends to the computation of f 1. Main consequence : existence of incremental quasi-polynomial time algorithm for RAS [Gunopulos et al., TODS, 2003]

9/24 9 / 24 Nouvelles classes de problèmes pour la fouille de motifs intéressants dans les bases de données 10 Preliminaries Dualization Problem The complexity depends on the structural properties of the poset (L, ). (L, ) is isomorphic to a boolean lattice : quasi-polynomial (Fredman et Khachiyan 96). (L, ) is isomorphic to a product of chains : quasi-polynomial (Elbassioni et al 09) (L, )) is the set of basis of a matroid : polynomial (Elbassioni et al 09) (L, ) is isomorphic to a lattice : conp-complete (Babin et Kuznetsov 11). (L, ) is a distributive lattice : OPEN.

10/24 10 / 24 Nouvelles classes de problèmes pour la fouille de motifs intéressants dans les bases de données 11 Plan 1 Preliminaries 2 3 Concluding remarks

11/24 11 / 24 Nouvelles classes de problèmes pour la fouille de motifs intéressants dans les bases de données 12 Limits of RAS (1) The surjectivity constraint the number of patterns has to be equal to 2 n, very unlikely in practice (2) The embedding preserves comparability General Posets : it is not always possible preserve borders

12/24 12 / 24 Nouvelles classes de problèmes pour la fouille de motifs intéressants dans les bases de données 13 Sequence with wildcards Let Σ be an alphabet and Σ be a wildcard. An input sequence, S Σ of length n 0. A rigid pattern is a string P (Σ { }) of length m n such that P[1] and P[m]. For patterns P[1..m] and Q[1..n], we say that P occurs in Q at position p [1..n], denoted by P p Q, if for every i [1..m] P[i] = Q[p + i 1] or P[i] =.

13/24 13 / 24 Nouvelles classes de problèmes pour la fouille de motifs intéressants dans les bases de données 14 The poset of rigid patterns Let L S be the set of all rigid motifs over Σ. We have (L S, ) a partial order. Bd ({ab}) = {aa, bb, ba, a a, a b, b a, b b, a a, a b, b a, b b}

14/24 14 / 24 Nouvelles classes de problèmes pour la fouille de motifs intéressants dans les bases de données 15 The poset of rigid patterns (1) (L S, ) is not isomorphic to a boolean lattice (1) (L S, ) cannot be embedded in a boolean lattice such that borders are preserved Dualisation of sequences does not belong to RAS Mapping which does not preserve comparability?

5/24 15 / 24 Nouvelles classes de problèmes pour la fouille de motifs intéressants dans les bases de données 16 The encoding of Arimura 2009 Let S be a sequence of size n on Σ. The embedding is as follows : Let R = {(i, x) i [1..n], x Σ}. Let f : L S 2R with f (P[1..m]) = {(i, P[i]) i [1..m] and P[i] Σ} For Σ = {a, b} and n = 4, we have R = {(1, a), (2, a), (3, a), (4, a), (1, b), (2, b), (3, b), (4, b)} f (abab) = {(1, a), (2, b), (3, a), (4, b)}, f (ab b) = {(1, a), (2, b), (5, b)}.

16/24 16 / 24 Nouvelles classes de problèmes pour la fouille de motifs intéressants dans les bases de données 17 The encoding of Arimura 2009 f is not surjective : Let X = {(1, a), (2, b)} and X = {((2, a), (3, b)}. The sequence ab corresponds to X while X is not an image of f. f is not monotonic : For the two patterns bb and abb. we have bb abb whereas f (bb) f (abb).

7/24 17 / 24 Nouvelles classes de problèmes pour la fouille de motifs intéressants dans les bases de données 18 Properties of the mapping Given a pattern P L S, then f (P) must contain a unique symbol in each index ; and f (P) must contains (1, x) for some symbol x Σ. The following sets characterize elements of P(R) which do not have an image by the coding f. F = {{(i, x), (i, y)} such that x, y Σ, i [1..n]} F + = {(i, x) such that x Σ, i [2..n]}

18/24 18 / 24 Nouvelles classes de problèmes pour la fouille de motifs intéressants dans les bases de données 19 Properties of the mapping Let Σ = {a, b} and n = 4. Then F + = {(2, a), (3, a), (4, a), (2, b), (3, b), (4, b)} F = {{(1, a), (1, b)}, {(2, a), (2, b)}, {(3, a), (3, b)}, {(4, a), (4, b)}} The decoding { function g is defined as follows : θ if X = f (θ) g(x) = otherwise (i.e. X F F + ) Let A P(R) such that A F and A F +. Then f (g(a)) = A.

19/24 19 / 24 Nouvelles classes de problèmes pour la fouille de motifs intéressants dans les bases de données 20 Results There is a bijection between L S and P(R) \ ( F F + ). f (L S ) is convex in 2R Bipartition is possible f preserves the incomparability of patterns but not comparability, i.e. if ϕ θ then f (ϕ) f (θ) bb abb but f (bb) = {(1, b), (2, b)} f (abb) = {(1, a), (2, b), (3, b)} The size of the borders may increase

20/24 20 / 24 Nouvelles classes de problèmes pour la fouille de motifs intéressants dans les bases de données 21 Results

21/24 21 / 24 Nouvelles classes de problèmes pour la fouille de motifs intéressants dans les bases de données 22 Results Theorem The extra elements can be partionned into two parts E + and E

Nouvelles classes de problèmes pour la fouille de motifs intéressants dans les bases de données 23 Results Lemma E + and E are computable in polynomial time in the size of the two borders. Main theorem The dualization problem of sequences can be polynomially reduced to hypergraph transversal problem. 22/24 22 / 24

23/24 23 / 24 Nouvelles classes de problèmes pour la fouille de motifs intéressants dans les bases de données 24 Concluding remarks Plan 1 Preliminaries 2 3 Concluding remarks

24/24 24 / 24 Nouvelles classes de problèmes pour la fouille de motifs intéressants dans les bases de données 25 Concluding remarks Concluding remarks New classes of pattern mining problems : RAS EWRAS WRAS Existence of incremental quasi-polynomial time algorithms for EWRAS SEQ belongs to EWRAS very useful to clarify existing pattern mining contributions