LABORATOIRE D INFORMATIQUE DE NANTES-ATLANTIQUE UMR 6241 ÉCOLE DOCTORALE STIM, N. 503 «Sciences et technologies de l information et des mathématiques» Ph. D. Proposal for 2013 Functional monitoring problem for distributed large-scale data streams Ph. D. Thesis Director NAME, Surname : MOSTEFAOUI, Achour Research Team : GDD Laboratory : LINA (UMR 6241) Affiliation : Université de Nantes, France E-mail : Achour.Mostefaoui (at) univ-nantes.fr Phone : (+33/0) 2 51 12 58 09 Advising Rate : 40 % Current Ph. D. Students : 2 Ph. D. adviser NAME, Surname : BUSNEL, Yann Research Team : GDD Affiliation : Université de Nantes, France E-mail : Yann.Busnel (at) univ-nantes.fr Phone : (+33/0) 2 51 12 58 95 Advising Rate : 60 % Current Ph. D. Students : 1 Possible Funding : MESR
Ph. D. Proposal for 2013 Functional monitoring problem for distributed large-scale data streams Abstract. In this PhD proposal, we consider the setting of large scale distributed systems, in which each node needs to quickly process a huge amount of data received in the form of a stream that may have been tampered with by an adversary. In this situation, several fundamental problems has been raised recently, that concern many domains including machine learning, data mining, databases, information retrieval, and network monitoring. In all these applications, it is necessary to quickly and precisely process a huge amount of data. We propose to combine sampling techniques and information-theoretic methods to extract pertinent information from such a streams (metrics, summaries, pattern matching, etc.). Unfortunately, computing information theoretic measures in the data stream model is challenging essentially because one needs to process a huge amount of data sequentially, on the fly, and by using very little storage with respect to the size of the stream. In addition the analysis must be robust over time to detect any sudden change in the observed streams (which may be the manifestation of routers deny of service attack or worm propagation). On the other hand, very few works have tackled the distributed streaming model, also called the functional monitoring problem [CMY08], which combines features of both the streaming model and communication complexity models. As in the streaming model, the input data is read on the fly, and processed with a minimum workspace and time. In the communication complexity model, each node receives an input data stream, performs some local computation, and communicates only with a coordinator who wishes to continuously compute or estimate a given function of the union of all the input streams. The challenging issue in this model is for the coordinator to compute the given function by minimizing the number of communicated bits [CMY08, ABC09, GT01]. Keywords. Large-scale Data Stream; Randomized approximation algorithm; Functional monitoring problem; Byzantine Adversary; Performance Analysis 2
Introduction Context and issues The interest of estimating metrics or identify specific patterns between several data streams is important in data intensive applications. Many different domains are concerned by such analyses including machine learning, data mining, databases, information retrieval, and network monitoring. In all these applications, it is necessary to quickly and precisely process a huge amount of data. For instance, in IP network management, the analysis of input streams allows to rapidly detect the presence of anomalies or intrusions when changes in the communication patterns occur. The problem of extracting pertinent information in a data stream is similar to the problem of identifying patterns that do not conform to the expected behavior, which has been an active area of research for many decades. For instance, depending on the specificities of the domain considered and the type of outliers considered, different methods have been designed, namely classification-based, clustering-based, nearest neighbor based, statistical, spectral, and information theory. A comprehensive survey of these techniques, their advantages and their drawbacks is given in [CBK09]. A common feature of these techniques is their space complexity and their computational cost, as they rely on full space algorithms for analyzing their data. As a specific example, the main objective of [AB12a] is the online estimation of the similarity between observed data streams and expected (i.e. idealized) ones in order to detect in real time the presence of intrusions in network traffic. More precisely, we have proposed a distributed algorithm that approximates with guaranteed error bounds in a single pass and with both a small amount of storage memory and processing capacity, the relative entropy between massive and high frequency distributed sequences of data. This works perfectly fits the IP network traffic context, however it could be applied to any other data issued from distributed applications such as social networks or sensor readings. Given our settings the real time monitoring of network traffic with little capacities in terms of storage and processing relying on full space algorithms for analyzing input data is not feasible. In contrast, two main approaches exist to monitor in real time massive data streams. The first one consists in regularly sampling the input streams so that only a limited amount of data items is locally kept [SKS + 03, KGKM05, LCC05]. This allows to exactly compute functions on these samples. However, accuracy of this computation, with respect to the stream in its entirety, fully depends on the volume of data that has been sampled and their locations in the stream. Worse, an adversary may easily take advantage of the sampling policy to hide its attacks among packets that are not sampled, or in a way that prevents its malicious packets to be correlated. In contrast, the streaming approach consists in scanning each piece of data of the input stream on the fly, and in locally keeping only compact synopses or sketches that contain the most important information about data items. This approach enables to derive some data streams statistic with guaranteed error bounds without making any assumptions on the order in which 3
data items are received at nodes (i.e., data items ordering can be manipulated by an omnipotent adversary [ABG10]). Most of the research done so far with this approach has focused on computing functions or statistic measures with error ε using poly(1/ε, log n) space where n is the domain size of the data items. These include the computation of the number of different data items in a given stream [BYJK + 02, FM85, KNW10], the frequency moments [AMS96], the most frequent data items [AMS96, CCFC04], the entropy of the stream [CCM07, LSO + 06], or the relative entropy between one data stream and the uniform one [AB12a, ABG12]. Problems and opportunities Unfortunately, computing information theoretic measures in the data stream model is challenging essentially because one needs to process a huge amount of data sequentially, on the fly, and by using very little storage with respect to the size of the stream. In addition the analysis must be robust over time to detect any sudden change in the observed streams (which may be the manifestation of routers deny of service attack or worm propagation). On the other hand, very few works have tackled the distributed streaming model, also called the functional monitoring problem [CMY08], which combines features of both the streaming model and communication complexity models. As in the streaming model, the input data is read on the fly, and processed with a minimum workspace and time. In the communication complexity model, each node receives an input data stream, performs some local computation, and communicates only with a coordinator who wishes to continuously compute or estimate a given function of the union of all the input streams. The challenging issue in this model is for the coordinator to compute the given function by minimizing the number of communicated bits [CMY08, ABC09, GT01]. Cormode et al. [CMY08] pioneer the formal study of functions in this model by focusing on the estimation of the first three frequency moments F 0, F 1 and F 2 [AMS96]. Arackaparambil et al. [ABC09] consider the empirical entropy estimation [AMS96] and improve the work of Cormode by providing lower bounds on the frequency moments, and finally distributed algorithms for counting at any time t the number of items that have been received by a set of nodes from the inception of their streams have been proposed in [HYZ12, LRV12]. For instance, following this model, we go a step further by proposing an estimator called AnKLe (Attack-tolerant enhanced Kullback-Leibler divergence Estimator) that estimates the relative entropy, or the Kullback-Leibler (KL) divergence between distributed streams. This divergence can be viewed as an extension of the Shannon entropy and is often referred to as the relative entropy [CT91]. Citing Chakrabarti et al. [CBM06], [...] rationale of estimating entropy-based distances is that there are intimate connections between the randomness of traffic sequences (formalized as the entropy) and the propagation of malicious events. Indeed, detecting sudden changes in a stream may be a good indicator of attacks. Note that in [GIM08], the authors propose a characterization of 4
the information divergences that are not sketchable. They have proven that any distance that has not norm-like properties is not sketchable. Requested work Objectives Most of the work proposed for a single stream are clearly not adaptable to the distributed functional monitoring model [CMY08]. The concrete objective of this PhD proposal is the design and the prototypical implementation of some efficient one-pass distributed algorithms, in the context of social-based personal cloud network, where massive data stream exchanges is the norm. Specifically, the first objective of this thesis is to propose an enhanced metric that reflects the relationships between any set of discrete probability distributions in the context of massive data streams, in order to modelize any user behavior and to provide some relationship metric nor proximity between them. This metric should be able to efficiently estimate a broad class of distances measures between large data streams by computing these distances only using compact synopses or sketches of the streams. It should be distribution-free and should make no assumption about the underlying data volume. The second step is to propose a one-pass distributed algorithm that approximates this novel metric with a given probability δ. This algorithm should use very little space and few operations (i.e., sublinear in the parameters of the system size of the stream, number of distinct items in the streams). Implementation and comparison with previous contribution of the literature is obviously a mandatory step to validate the relevance of the solution. Finally, the existing literature in the context of the distributed functional monitoring model does not take into account the semantic and/or the type of data that composed all the streams. This approach provides some very generic solutions but should not be optimal in specific application context [AB12a]. As our system architecture and applications are clearly focused for this PhD, we also plan to restrict the system model and to propose some enhanced algorithm by taking into account the inherent syntax or content of these data streams. Working plan The work will obviously starts by a study of the state of the art, which shares several problematics (data stream model, distributed functional monitoring problem, social networks, etc.). Quite early, some experimentation should be realized to compare existent solutions on different extended data sets, in order to identify which approach are optimized in our context (for instance, sampling approach as in [AMS96, BYJK + 02] in contrast of sketching ones [AB12b, SKS + 03]). Moreover, all experimentations should be realized on the testbed composed by a 48 Rasberry Pi cluster, own by the GDD team. 5
This first phase should let us raised a first algorithm design and some concrete targeted application. It will then require to incrementally refine this propositions both of design and implementation. By the way, in addition to the experimental validation of these new solutions, all proposed algorithms would be theoretically proved in the aforementioned model. This permits to extract lower and upper bounds in term of space and time complexity, and to estimate precisely the approximation error due to the probabilistic approach of this model. Applicants Skills The applicants must have important algorithmic skills and should know about applied mathematics, such as probability and statistics. Mastering distributed systems would be appreciate. Applicants names and academic results (if any) We have an applicant that will potentially make her M2 internship in our team in the next six months and that is interested in the subject: Mlle NGUYEN CHI HOANG (Vietnamienne) Master en Informatique, option Réseaux et Systèmes Communicants IFI-Institut de la Francophonie pour l Informatique, Vietnam et double diplomation avec l Université Lyon 1, France Deuxime prix dans la compétition nationale en mathématiques pour les étudiants vietnamiens à Vinh, Nghe An, Vietnam Certification FE - Standard des compétences de l ingénieur informatique. Certification est attribuée par JITEC (Japon) et VITEC (Vietnam) But we do not forbid to seek another excellent candidate for this PhD proposal if any. 6
Bibliography [AB12a] [AB12b] E. Anceaume and Y. Busnel. An information divergence estimation over data streams. In Proceedings of the 11th IEEE International Symposium on Network Computing and Applications (NCA), 2012. Emmanuelle Anceaume and Yann Busnel. Sketch *-metric: Comparing Data Streams via Sketching. Technical report, CIDER - IRISA, CIDRE - INRIA - SUPELEC, Laboratoire d Informatique de Nantes Atlantique - LINA, 7 2012. 12 pages, double colonnes. [ABC09] C. Arackaparambil, J. Brody, and A. Chakrabarti. Functional monitoring without monotonicity. In Proceedings of the 36th ACM International Colloquium on Automata, Languages and Programming: Part 1, 2009. [ABG10] [ABG12] [AMS96] E. Anceaume, Y. Busnel, and S. Gambs. Uniform and Ergodic Sampling in Unstructured Peer-to-Peer Systems with Malicious Nodes. In Proceedings of the 14th international conference on Principles of distributed systems (OPODIS), volume 6490, pages 64 78, 2010. E. Anceaume, Y. Busnel, and S. Gambs. Ankle: Detecting attacks in large scale systems via information divergence. In Proceedings of the 9th European Dependable Computing Conference (EDCC), 2012. N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. In Proceedings of the 28th annual ACM symposium on Theory of computing (STOC), pages 20 29, 1996. [BYJK + 02] Z. Bar-Yossef, T. S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting distinct elements in a data stream. In Proceedings of the 6th International Workshop on Randomization and Approximation Techniques (RAN- DOM), pages 1 10. Springer-Verlag, 2002. [CBK09] [CBM06] [CCFC04] V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM Computing Surveys, 41(3):1 58, 2009. A. Chakrabarti, K. Do Ba, and S. Muthukrishnan. Estimating entropy and entropy norm on data streams. In In Proceedings of the 23rd International Symposium on Theoretical Aspects of Computer Science (STACS). Springer, 2006. M. Charikar, K. Chen, and M. Farach-Colton. Finding frequent items in data streams. Theoretical Computer Science, 312(1):3 15, 2004. 7
[CCM07] [CMY08] [CT91] [FM85] [GIM08] [GT01] A. Chakrabarti, G. Cormode, and A. McGregor. A near-optimal algorithm for computing the entropy of a stream. In ACM-SIAM Symposium on Discrete Algorithms, pages 328 335, 2007. G. Cormode, S. Muthukrishnan, and K. Yi. Algorithms for distributed functional monitoring. In Proceedings of the 19th annual ACM-SIAM Symposium On Discrete Algorithms (SODA), 2008. T.M. Cover and J.A. Thomas. Elements of information theory. Wiley New York, 1991. P. Flajolet and G. Nigel Martin. Probabilistic counting algorithms for data base applications. Journal of Computer and System Sciences, 31(2):182 209, 1985. Sudipto Guha, Piotr Indyk, and Andrew Mcgregor. Sketching information divergences. Machine Learning, 72(1-2):5 19, 2008. P. B. Gibbons and S. Tirthapura. Estimating simple functions on the union of data streams. In Proceedings of the Thirteenth Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA), pages 281 291, 2001. [HYZ12] Z. Haung, K. Yi, and Q. Zhang. Randomized algorithms for tracking distributed count, frequencies and ranks. In Proceedings of 31st ACM Symposium on Principles of Database Systems (PODS), 2012. [KGKM05] V. Karamcheti, D. Geiger, Z. Kedem, and S. Muthuskrishnan. Detecting malicious network traffic using inverse distribution of packet contents. In Proceedings of the workshop on Mining Network Data (MineNet) co-located with ACM SICOMM, 2005. [KNW10] [LCC05] [LRV12] [LSO + 06] [SKS + 03] D. M. Kane, J. Nelson, and D. P. Woodruff. An optimal algorithm for the distinct element problem. In Proceedings of the Symposium on Principles of Databases (PODS), 2010. A. Lakhina, M. Crovella, and C.Diot. Mining anomalies using traffic feature distributions. In Proceedings of the ACM SIGCOMM, 2005. Z. Liu, B. Radunovic, and M. Vojnovic. Continuous distributed counting for non-monotonic streams. In Proceedings of 31st ACM Symposium on Principles of Database Systems (PODS), 2012. A. Lall, V. Sekar, M. Ogihara, J. Xu, and H. Zhang. Data streaming algorithms for estimating entropy of network traffic. In Proceedings of the joint international conference on Measurement and modeling of computer systems (SIGMETRICS). ACM, 2006. B. Krishnamurthy Subhabrata, E. Krishnamurthy, S. Sen, Y. Zhang, and Y. Chen. Sketch-based change detection: Methods, evaluation, and applications. In Internet Measurement Conference, pages 234 247, 2003. 8
CV du directeur de thèse : Achour MOSTEFAOUI Situation professionnelle actuelle J étais Maître de conférence l Université de Rennes (UFR ISTIC) et membre du laboratoire IRISA/INRIA jusqu en octobre 2011 date à laquelle j ai rejoint l Université de Nantes, en qualité de Professeur des Université, au sein du département informatique de l UFR Sciences et Technique. Thématiques de recherche Ma recherche est principalement centrée autour des problèmes de synchronisation, de gestion des données réparties et de tolérance aux fautes dans les systèmes répartis. Ces dernières années, je m intéresse en plus à la modélisation et à la définition d abstractions qui tiennent compte de l évolution des systèmes répartis (dynamicité, mobilité). Direction de thèses Tyler Crane. J encadre une partie des travaux de thèse de Tyler Crane qui a effectué son Master à l Université de Santa Barbara sous la direction du Prof. Amr El Abbadi. Il est financé par une bourse européenne Marie-Curie. Tyler travaille sur la gestion des données dans les machines multi-coeur suivant le modèle transactionnel. Ce modèle naguère réservé aux bases de données s est révélé être le modèle le plus adéquat pour le partage transparent des données entre différents coeurs d une même machine. Hamouma Moumen. J encadre les travaux de thèse de Hamouma Moumen (thèse en co-direction, université de Bougie, Algérie). Moumen explore dans sa thèse différents moyens d élire des leaders (même locaux) dans les systèmes ouverts. Le but est de tirer partie de toute propriété temporelle [16] ou non [11] apparaissant dans un système et d en tirer profit sans savoir a priori où la propriété va apparaître. Gilles Trédan. J ai encadré les travaux de thèse de Gilles Trédan (bourse du ministère). Gilles a exploré l émergence, la création et le maintien de structures dans les systèmes répartis et ceci afin de résoudre certains problèmes de base des systèmes répartis tels que le consensus byzantin [16,18], l assignation de coordonnées virtuelles [15] et la sécurisation de communications [10] dans les réseaux de capteurs, et l estimation du nombre de processus en vie dans un système asynchrone [2]. Corentin Travers. J ai co-encadré les travaux de thèse de Corentin Travers (bourse du ministère). Corentin Travers a abordé dans sa thèse l implémentation de détecteurs de défaillances et le lien entre la puissance d un détecteur de défaillances et la difficulté d un problème. Dans une première étape, il a proposé différents protocoles permettant d implémenter un leader en combinant des propriétés temporelles et atemporelles dans 9
le même protocole. D autre part, il a étudié la difficulté relative de certains problèmes d accord afin de cerner les fondements du calcul réparti [1,3,7,8,19]. Encadrement de Master Depuis 2007, j ai encadré quatre étudiants de Master recherche. François Joulaud (2007-2010) A intégré une SS2I dans le bassin rennais. Lilia Ziane Khodja (2008-2011) Actuellement en thèse à l Université de Belfort. Olivier Baldellon (2009-2012) Actuellement en thèse au LAAS/CNRS à Toulouse. Julien Stainer (2010-en cours) Actuellement en thèse à l Université de Rennes 1. Comités de programme de conférences et revues à audience internationale J ai fait partie de comités de programme de conférences internationales et j ai été co-chair d un workshop d OTM en 2008 et local chair d un track de Europar en 2007. Je suis membre du comité d édition de la revue International Journal on Communication and Networks et j ai relu des articles pour différentes revues. Invitations Durant ces quatre dernières années, j ai été invité à plusieurs reprises et j ai moi-même invité des collègues. Entre autres, j ai été invité pour donner un séminaire à l Université de Newsactle (GB) en 2007, chez IBM Bangalore en Inde en 2009, au Loria Nancy 2010 et au LIP6 à Paris 2008. D autre part, j ai invité Marc Shapiro, DR INRIA Rocquencourt en 2010, le professeur Raimundo Macedo de l Université fédérale de Bahia (Brésil) en 2011 et le professeur Dariusz Kowalski de l Univesrité de Liverpool (GB) en 2011. Publications 8 articles de revues internationales avec comité de rédaction; 13 artciles dans des actes de conférences internationales avec comité de sélection. 10
CV du co-encadrant : Yann BUSNEL Situation professionnelle actuelle Après une thése soutenue à l Université de Rennes 1 et un an de post-doc à l Université de Rome, La Sapeinza, je suis Maître de Conférences depuis septembre 2009, à l Université de Nantes Département Informatique de l UFR Sciences et Techniques, et membre du Laboratoire d Informatique de Nantes Atlantiques (UMR 6241). Thématiques de recherche Mots-clés : Systèmes distribués, Auto-organisation, Système large-échelle, Flux de données massives, Paradigme pair-à-pair, Protocoles épidémiques, Protocoles de population, Réseaux de capteurs sans fil, Réseaux sociaux, Analyse stochastique et théorique, Simulation et évaluation pratique. Encadrements Depuis Octobre 2010, je co-encadre Nagham Alhadad, une étudiante en doctorat sur financement CNRS avec Patricia Serrano-Alvarado et Philippe Lammare. Son sujet de recherche se situe autour des systèmes communautaires centrés sur l utilisateur. De plus, durant mon séjour post-doctoral à l université de Rome, j ai collaboré activement avec deux doctorants, lesquels étaient sous la direction de Roberto Baldoni. Dans ce cadre, je les dirigeais et les accompagnais dans leurs recherches et dans les validations expérimentales qu ils effectuaient : Silvia Bonomi. En collaboration avec Ravi Prakash (Université de Dallas au Texas, USA), nous avons étudié la possibilité d allonger la durée de vie d un réseau de capteurs sans-fil et sa fia- bilité par l élaboration de rapport cyclique (duty cycle) entre des couches logiques fortement connexes [11],[27]. Marco Platania. En collaboration avec Giorgia Lodi et Leonardo Querzoni, nous étudiions la décentralisation des réseaux sociaux numériques de type Facebook. Dans ce cadre, nous cherchions à implémenter de tels réseaux dans un contexte totalement réparti. Encadrement de stages de master Dans le cadre de ce séjour, j ai co-encadré de deux étudiants en Master d Ingénierie Informatique et Système (niveau MSc). Michele Dominici, effectuait sous mon co-encadrement (avec Roberto Baldoni, Giorgia Lodi et Leonardo Querzoni) une thèse de master portant sur le déploiement réel d un réseau de capteurs dans le contexte de la domotique. Celui-ci devait répondre à des caractéristiques de réseaux dits intelligents. J effectuais l encadrement et le suivi régulier de l étudiant, ainsi que sa formation technique et méthodologique. Luca Doria effectuait sous mon co-encadrement (avec Roberto Baldoni, Giorgia Lodi et Leonardo Querzoni) un 11
stage portant sur la création d un réseau multi-couches permettant de fournir un ensemble de propriétés et de services identifiés, dans le contexte du projet européen CoMiFiN. J effectuais l encadrement et le suivi régulier de l étudiant, ainsi que sa formation technique et méthodologique. Comités de relectures J ai membre des comités de programme des conférences internationales (NCA 2013, PDP 2012,...) et j ai participé à des comités de relecture en tant que rapporteur externe pour des revues internationnales (Elveiser Computer Networks). J ai également été président du comité de programme de deux workshops internationaux (DYNAM 2011 et TADDS 2012). Responsabilités J ai été responsable des relations internationales du département informatique de 2010 à 2012 et suis membre du conseil de département depuis 2011. Liste complète des publications À ce jour, ma bibliographie personnelle s élève à 32 articles (hors posters et rapports de recherche) parmi lesquels 3 livres (2 actes de workshop et une publication de mes travaux de doctorat), 5 articles dans des revues (dont 3 internationales de rangs A et A*), 24 articles dans des conférences avec comité de lecture (dont 15 internationales), plus diverses sessions poster avec comité de lecture et rapports de recherche. La liste complète est disponible à l adresse : http://pagesperso.lina.univ-nantes. fr/~busnel-y/yann_busnels_home_page/complete_bibliography.html 12