Activité Wi-Fi en environnement ouvert : outils, mesures et analyses

Transcription

1 Thèse de doctorat UNIVERSITÉ PIERRE ET MARIE CURIE École doctorale INFORMATIQUE, TÉLÉCOMMUNICATIONS ET ÉLECTRONIQUE présentée par Thomas Claveirole pour obtenir le grade de Docteur de l Université Pierre et Marie Curie Activité Wi-Fi en environnement ouvert : outils, mesures et analyses à soutenir le 26 février 2010 devant le jury composé de : Ana Cavalli Rapporteur Prof. TELECOM & Management SudParis Thierry Turletti Rapporteur Chargé de recherche INRIA Khaldoun Al Agha Examinateur Prof. Université Paris-Sud 11 Guillaume Chelius Examinateur Chargé de recherche INRIA Marcelo Dias de Amorim Co-encadrant Chargé de recherche CNRS Serge Fdida Co-encadrant Prof. Université Pierre et Marie Curie Numéro bibliothèque :

2

3 PhD Thesis UNIVERSITY PIERRE AND MARIE CURIE Doctoral school COMPUTER SCIENCE, TELECOMMUNICATIONS, AND ELECTRONICS presented by Thomas Claveirole submitted for the degree of Doctor of Science of the University Pierre and Marie Curie Wi-Fi Activity in Open Environments: Tools, Measurements, and Analyses Commitee in charge: Ana Cavalli Reviewer Prof. TELECOM & Management SudParis Thierry Turletti Reviewer INRIA researcher Khaldoun Al Agha Examiner Prof. University of Paris-Sud 11 Guillaume Chelius Examiner INRIA researcher Marcelo Dias de Amorim Co-advisor CNRS researcher Serge Fdida Co-advisor Prof. University Pierre and Marie Curie

4

5 Remerciements JE tiens tout d abord à remercier mes rapporteurs et mon jury. Des chercheurs confirmés qui me consacrent du temps malgré des emplois du temps chargés, alors que rien ne les y oblige ; cela m a toujours paru un peu saugrenu. Je ne peux donc que témoigner ma gratitude. Merci à Ana Cavalli, Thierry Turletti, et Guillaume Chélius. Je me permet de remercier Khaldoun Al Agha à part, spécialement, parce que les réseaux, c est lui qui m a jeté dedans, il y a cinq ans. Également, impossible d oublier Serge Fdida et Marcelo Dias de Amorim. Merci pour l encadrement, pour l accueil, pour tout ce que vous m avez apporté. Marcelo, particulièrement, je dois te dire quelque chose. Plusieurs fois, je venais te voir avec le moral à zéro. Je suis dans une impasse, rien ne va marcher comme prévu, c est sûr. Et toujours, tu discutes, et me voila gonflé à bloc, motivé comme jamais. Je ne sais pas comment tu fais. Je n ai pas compris ton secret. Mais je souhaite ne jamais travailler qu avec des personnes aussi motivantes que toi. Je remercie aussi Mathias Boc. Sans notre collaboration, cette thèse serait un peu moins complète. Et puis, reconnaissons le, je lui dois la moitié d un voyage à San Francisco... Il y a aussi des personnes un peu spéciales, que je voudrais absolument remercier. Ce sont mes amis de mon ancienne école, l ÉPITA. Entre autres le laboratoire de recherche et de développement (Akim, Théo, Raph., tant d autres). Mais aussi Claire, Max., Alexandre, tous les (ex-)élèves-assistants. Merci. Bien avant ma thèse j ai partagé beaucoup avec vous. Pas seulement du travail, mais beaucoup de travail tout de même. Et cela a modifié ma façon d appréhender l informatique. Au risque de passer pour fou, je le dis : je pense que vos personnalités suintent de mes recherches, de mes lignes de code, de mes articles. Mais j ai beaucoup d autres amis à remercier. Il m aura fallu du temps pour bien mesurer ce que ma thèse leur doit. Il aura également fallu qu un quiproquo sur le sujet m oppose à l un d eux. Par erreur, presque par hasard, au moment même où j écris ces lignes. Merci donc Yosra, Brice, Salim, Amélie, Mathias, Cédric. Merci P.-E., Thomas, Clémence, Matthieu, Anneli. Merci Magali, merci Sophia, merci Élodie. Merci Pierre. Ça fait beaucoup de mercis, et beaucoup de gens, c est vrai. Et pourtant, je suis sûr que j en oublie. Plein. Et pourtant, ils ont tous indirectement participé à ma thèse, pas toujours consciemment. Merci. Enfin, je termine avec une pensée pour ma famille. Et notamment pour mes grandsparents : ils m ont déjà dit combien ils seront fiers d avoir un petit-fils docteur. i

6 ii

7 Résumé Depuis environ dix ans, le Wi-Fi rencontre un énorme succès. En conséquence, une partie importante de la recherche sur les réseaux consiste a mesurer son protocole sous-jacent, IEEE , afin de mieux le comprendre. Le sniffing est l une des techniques utiles à cette compréhension. Elle consiste a déployer des moniteurs au sein d une zone de mesure, qui enregistrent tout le trafic pouvant être entendu. C est une technique passive, et chaque moniteur produit des traces de paquets. C est également une opération fondamentale pour un certain nombre d opérations, dont le diagnostique de problèmes réseau, l amélioration de la sécurité, et l analyse du comportement de certains protocoles. Les travaux existants qui se basent sur le sniffing soulèvent un certain nombres de questions. Alors que cette technique repose essentiellement sur la manipulation de traces de paquets IEEE , il n existe pas de boîte à outil logicielle générique pour effectuer ces manipulations. En conséquence, des efforts sont dupliqués, certains outils sont trop spécifiques, l interopérabilité est parfois mauvaise, et les performances pas toujours au rendez-vous. C est particulièrement vrai dans le cas de la fusion de traces. Alors qu il s agit d une étape commune à plusieurs études, peu d outils existent, dont l utilisabilité est limitée. En dehors de ces problèmes prosaïques il existe aussi des questions de plus haut niveau. D abord, il existe une incertitude sur la précision que l on peut attendre des moniteurs. Ensuite, la plupart des études se concentrent sur les caractéristiques de bas niveau de IEEE Dans la mesure ou ce protocole est présent aujourd hui sur des nouvelles catégories d appareils, notamment des appareils mobiles, il serait également intéressant d étudier les habitudes de ses usages plutôt que les problèmes de protocole. Enfin, la plupart des expériences se concentrent sur des environnements académiques (universités, laboratoires, conférences). Il est vraisemblable d imaginer que d autres environnements offrent des caractéristiques différentes. Au sein de cette thèse, nous proposons WiPal, un ensemble logiciel pour traiter les traces de paquets IEEE , et nous l utilisons pour résoudre les problèmes précédemment décris. WiPal inclue une bibliothèque générique pour manipuler les traces de paquets et les trames IEEE Il fournit également un ensemble de programmes au dessus de cette bibliothèque. Ceux-ci permettent d effectuer des opérations diverses (par exemple concaténation ou comparaison), d extraire des statistiques, de rendre des traces anonymes, ou encore, de fusionner des traces. Afin de rendre WiPal générique et efficace, nous avons développés plusieurs iii

8 algorithmes spécifiques, ainsi que des optimisations pour pouvoir traiter efficacement de grands jeux de données. Grâce à l utilisation de WiPal, nous effectuons plusieurs analyses dans différents environnements. En analysant deux jeux de données de courtes durées nous améliorons notre compréhension de la précision du sniffing. Ensuite, en analysant trois jeux de données de longues durées (plusieurs jours) dans des environnements différents nous obtenons une meilleure compréhension des comportements journaliers des utilisateurs vis à vis des réseaux sans-fils. Ces environnements possèdent des caractéristiques sociales différentes : un espace de bureaux, une zone pavillonnaire, et une zone résidentielle urbaine dense. Nos résultats dévoilent des propriétés nouvelles et inattendues. Par exemple, nous montrons que les techniques usuelles de mesure de précision des traces ne sont pas aussi fiables que prévu. Ou encore, que les traces de longues durées contiennent une très faible proportions d utilisateurs réguliers. Mots-clefs Mesure, Wi-Fi, IEEE , sniffing, trace de paquets, fusion de traces. iv

9 Abstract For about a dozen years Wi-Fi has known a tremendous success. Consequently, a large part of networking research has focused on measuring and understanding its corresponding protocol, IEEE Among the techniques that proved to be useful to this research is wireless sniffing. Such a passive measurement technique consists in spreading within some target area a number of monitors that capture all wireless traffic they hear to produce packet traces. It is a fundamental step in a number of network operations, including network diagnosis, security enhancement, and behavioral analysis of protocols. Existing work based on wireless sniffing raises however a number of issues. Despite IEEE packet trace manipulations are fundamental to this technique, no generic framework exists to carry them. This results in duplicated efforts among scientists, overspecialized tools, bad interoperability, and sometimes sub-optimal performance. This is especially true for trace merging. Though being a common step in many studies, only a few tools exist to merge Wi-Fi traces, and they have limited usability. Beyond these prosaic problems there are also more challenging questions. First, there is a lack of insights into the accuracy one can expect from wireless sniffers. Second, most studies focus on low level characteristics of IEEE As Wi-Fi now equips new categories of mobile devices, studying usage patterns instead of protocol issues becomes also interesting. Finally, most experiments collect traces in academic environments (university campuses, laboratories, or conference venues). It is likely that other environments would display different properties. In this thesis we propose WiPal, a framework to process IEEE packet traces, and use it to tackle the aforementioned issues. WiPal includes a generic library to handle packet traces and IEEE frames. It also provides a set of programs atop this library. These programs feature miscellaneous operations such as concatenation or comparison, statistics extraction, trace anonymization, and, most notably, trace merging. We developed a number of specific algorithms and optimizations in order to make WiPal a generic tool able to cope efficiently with large datasets. By using WiPal, we perform a number of analysis on traces we collected in various environments. Through the analysis of two short-lived datasets using up to eight monitors, we extend our understanding on the accuracy of Wi-Fi traces. Then, through the analysis of three long-lived datasets (several days), we obtain a better understanding of people s daily behaviors with respect to the underlying wireless network. These environments present difv

10 ferent sociological means: an office area, a sparse residential area, and a dense residential area. Our results reveal unseen and unexpected properties. For instance, traditional techniques to estimate trace accuracy are much less reliable than previously thought, or regular users count for a very small portion of the total population in long-lived traces. Keywords Measurement, Wi-Fi, IEEE , sniffing, packet trace, trace merging. vi

11 Contents Remerciements Résumé Abstract Contents i iii v vii 1 Introduction Context: Wi-Fi measurements Issues with Wi-Fi sniffing and related techniques Contributions of this thesis WiPal: manipulating IEEE traces Applying WiPal: empirical analyses Outline I The WiPal trace manipulation framework 7 2 WiPal: overview and design Trace manipulation tools: related work Overview of WiPal Features Overall architecture Packet parsing PHY headers IEEE parsing Filters Filter sources: pcap abstractions Processing filters Performance evaluation Methodology Results Conclusion WiPal: IEEE trace merging Trace merging: state of the art WiPal s basics Detailed operation of WiPal s trace merging Identifying reference frames Extraction of unique frames vii

12 viii Contents Intersection Synchronization Merging Evaluation Correctness Efficiency Conclusion II Applying WiPal: empirical analyses 41 4 Accuracy of wireless packet sniffing Completeness evaluation: state of the art Datasets Overview Preliminary analysis Completeness evaluation: shortcomings Completeness and number of sniffers Methodology Results Conclusion Empirical analysis of Wi-Fi activity in three urban scenarios Setup Device diversity Cumulated activity durations Growth of the number of devices Activity/Mobility Behaviors Inter-activity patterns Predominant activity pattern Conclusion Conclusion and future work WiPal Wi-Fi sniffing accuracy Wi-Fi activity Perspectives Appendices 71 A Résumé de la thèse en français 73 A.1 Contexte A.1.1 Mesures passives Wi-Fi et sniffing A.1.2 Questions ouvertes A.2 Contributions de cette thèse A.2.1 WiPal : manipulation de traces IEEE A.2.2 Applications de WiPal : analyses empiriques A.3 Conclusion

13 Contents ix B WiPal manual 91 B.1 The programs B.1.1 Invocation B.1.2 Concatenation (and Prism noise filtering) B.1.3 Comparisons B.1.4 Sub-traces B.1.5 Merging B.1.6 Synchronization B.1.7 Unique frames B.1.8 Duplicate data frames B.1.9 Statistics B.1.10 Anonymization B.1.11 Miscellaneous programs B.1.12 Undocumented programs B.2 The library B.3 FAQ C List of publications 115 C.1 Journals C.2 Conferences C.3 Demos and posters C.4 Software C.5 Under review Bibliography 117 List of Figures 123 List of Tables 125 Listings 127

14 x Contents

15 Chapter 1 Introduction THE IEEE standard [37] defines base layers for wireless communications. It appeared about a dozen years ago, using the trademark Wi-Fi, and is widely used today. Personal computers featuring IP communications over wireless links rely almost exclusively on this protocol. Furthermore, Wi-Fi also plays a major role on other wireless-capable mobile devices: it is available on most PDA s, smart phones, portable music players, even some digital cameras. As a consequence, Wi-Fi is part of the landscape of ubiquitous computing [58]. Along with other wireless protocols, such as Bluetooth or GSM, it is involved in creating a transparent digital environment for everyday life. For instance, Wi-Fi access points provide Internet access (hotspots) in households, hotels, conferences, and many other places. Understanding how IEEE implementations behave in the wild, and what are the usage patterns of its users, is therefore essential. This insight is necessary for developing new applications and protocols, or improving existing ones. 1.1 Context: Wi-Fi measurements IEEE specifies a physical layer (PHY) and a medium access control scheme (MAC) for wireless networks. The PHY is in charge of encoding and decoding digital information (bit sequences) to and from radio wave signals. The MAC, on the other hand, schedules transmissions so devices can share the medium and do not interfere with each other. Despite mostly an industry-pushed standard, computer scientists have produced a wide amount of research concerning IEEE This includes specialized topics such as studying its PHY [30;45], its MAC [46], and other features such as security [12;14]. More generic research topics also involves this protocol: ad hoc and mesh networking [10;27], sensor networks [60], or pervasive computing [58]. A proper understanding of IEEE can therefore benefit all these topics. To achieve this understanding one needs both theoretical analyses 1

16 Context: Wi-Fi measurements Figure 1.1: Wireless sniffing: passive monitors listen to the wireless activity inside the measurement area. and practical experiments. This thesis focuses on experiments, and most specifically measurements of wireless networks in the wild. Every network measurement is either active or passive. Active measurements alter network traffic so they can evaluate various parameters. Classic active techniques include saturating a link to measure maximum throughput, or sending probes back and forth to evaluate round trip delays. On the opposite, passive measurements do not interfere with network traffic. This occurs, for instance, when taping a network link to analyze packet flows. Note that passive techniques still might interfere with the infrastructure: they might require users to embed a specific software, or administrators to plug specific tapping equipment. A common passive technique to measure wireless networks is sniffing. Wireless sniffing consists in spreading within some target area a number of monitors (or sniffers) that capture all wireless traffic they hear (see Figure 1.1). Sniffers produce traces composed of a succession of MAC frames. Wireless sniffing is a fundamental step in a number of network operations, including network diagnosis [23;34], security enhancement [12;48], and behavioral analysis of protocols [22;39;43;59]. Although not mandatory, it is also possible to use wireless sniffing to support some location systems [20;21;61]. It comes in a variety of flavors: there might be only one or several sniffers, sniffers may be commodity hardware or specialized devices, they can operate offline or be part of a wired infrastructure (among other parameters). In any cases, the sniffing operation is passive and does not interfere with the network s normal operation. Wireless sniffing often involves a centralized process that is responsible for merging the traces [22;43;59]. The objective is to have a global view of the wireless activity from multiple local measurements. By providing overlapping coverage zones, it is also possible to compensate for frame losses with data from different sniffers. Merging is however a difficult

17 Chapter 1. Introduction 3 task; it requires precise synchronization among traces (up to a few microseconds) and bearing the unreliable nature of the medium (frame loss is unavoidable). 1.2 Issues with Wi-Fi sniffing and related techniques There are still, however, a number of issues with Wi-Fi sniffing. This thesis focuses on technical issues. 1 We categorize them in two classes: issues with the technique itself and issues with the tools. This thesis addresses both, in an effort to collect new datasets and produce original analyses. Issues with the technique relates to the relevance of the produced traces. This includes sniffer accuracy. Even in good radio conditions, sniffers may miss successfully transmitted frames. In this context, a natural question arise: each sniffer trace being incomplete (i.e., lacking some frames), it is likely that a merged trace be incomplete as well. What is the accuracy one can expect from a single sniffer? From multiple sniffers? What results can be drawn from incomplete traces? Another issue regarding the relevance of traces concerns the available datasets. Despite Wi-Fi is almost ubiquitous, most of the datasets made available by the research community are about university campuses, laboratories, or conference venues [2]. This is partly due to current practices focusing on easy-to-access environments for researchers, but also to the fact that existing monitoring techniques only fit specific scenarios. Most of the techniques available in the literature either focus on one single network, or require setting up a whole infrastructure, or need intrusive access to one s network. Once down the street or inside an individual s house, such techniques are therefore difficult to implement. Wireless sniffing however has a strong potential for monitoring all kinds of environments: it is passive, it does not interfere with monitored networks infrastructures, and in some cases it does not even need to rely on any infrastructure at all. But this potential has remained unexploited so far. A consequence is most researchers restrict sniffing to studying protocol quirks [22;39;43]. We think however that sniffing could be a great tool to focus on wireless network usage in usually hard-to-reach environments (e.g., individual houses, streets, or parks). Issues with tools mostly relate to the manipulation of packet traces. Many network operations involve these traces: administrators use them for monitoring or troubleshooting, researchers use them for measurements, simulations, or validations. Wireless sniffers produce packet traces consisting of lists of MAC frames. Many tools exist for their creation and manipulation, but most of these are designed for a very specific goal, and carry their own packet-processing code. For instance, tcpdump [8] understands many network protocols, but its parsing code may not be used for other purpose than displaying packets on a terminal. As another exmample, Wireshark [6] is more generic, but it is still mostly visualization-oriented 1 Some non-technical issues also exist, for instance legal and ethical issues [11;55]

18 Contributions of this thesis and suffers from similar issues. Most packet processing programs have a good design and are very efficient regarding their focus, but each time one creates another tool to handle packet traces, it is impractical to rely on previous code. Furthermore, some tools suffer from performance issues (for instance, Scapy [5] is a powerful tool to analyze packet traces but is not tractable on large traces 1 GB or more). All of this makes carrying custom analyses on sniffer traces a fastidious process. It often requires developing new tools from scratch. For the same reasons, merging IEEE traces is also an issue. The literature has provided the community with a few merging tools, but most of them require a wired infrastructure [22;28]. The others are too specific to the experimentations conducted in the papers [43;44]. In order to make Wi-Fi sniffing generalizable to any environment, one needs both generic tools and tools that do not expect a wired infrastructure to be available. 1.3 Contributions of this thesis This thesis contributions are twofold. First, we develop a framework, called WiPal, to help processing IEEE packet traces. This framework includes a generic library to help developing new tools, and several hands-on utilities to perform predefined operations on trace files. These utilities include a trace merger with innovative features. Second, we perform two analyses using these tools. In order to carry these analyses we collect several datasets in various environments, including day-long traces from a residential area and an uptown location. The first analysis focuses on the accuracy of Wi-Fi sniffing, while the second studies Wi-Fi usage patterns WiPal: manipulating IEEE traces The first part of this thesis presents WiPal, our packet trace manipulation framework. WiPal was designed for performance, without any specific application in mind, but rather in the hope that others could rely on it to develop custom trace processing software. Though it focuses on handling the IEEE protocol, it provides several protocol-agnostic features. What renders WiPal interesting is its original design, and some novel features it has. In this thesis: We present generic patterns for handling various types of packet traces. For instance, using a pipe and filter mechanism to process packet traces, or using a static callback mechanism to generate both efficient and generic frame parsers. We present how some novel features might benefit to packet processing programs, and how to implement them. For instance, random access to a packet trace, or the ability to consider the aggregation of several files as one unique packet stream.

19 Chapter 1. Introduction 5 We raise a number of issues a program designer might encounter when writing packet trace processing software. We discuss existing practices to solve them and the specific solutions adopted by WiPal (how and why). We evaluate the performance of WiPal and compare it with other tools. The results show that WiPal s generic design does not impact its performances regarding execution speed: it can compete with specialized code. Also, some new features do not impact performance, and others, which are optional, only imply a limited overhead. A distinctive component of WiPal is its merging tool. This tool works offline and is able to merge IEEE packet traces. Its key features are performance, ease-of-use, and flexibility. As a consequence, its design do not assume features from traces that would require monitors to access a network infrastructure (e.g., some tools require network synchronization [22] ). It also supports most of the existing input formats (e.g., raw IEEE frames, Prism, Radiotap, and AVS headers). Finally, it is usable in a straightforward fashion by just calling the adequate programs on trace files (other mergers require more complex setups, generally involving various servers [22;28;43] ). This thesis motivates and describes WiPal s trace merger design: It proposes new algorithms for various stages of the merging process. In particular, the synchronization algorithm is a generalization of previous algorithms from the literature. It provides an analysis of the synchronization algorithm; we show that our algorithm is more accurate than previous algorithms. It provides a performance study that shows WiPal s merger is an order of magnitude faster than the other publicly available offline merger, namely Wit [43]. Our analyses rely on sixteen real traces from four distinct datasets (CRAWDAD s uw/sigcomm2004 [50], recorded during the SIGCOMM 2004 conference, and three private datasets we collected in various conditions). They allow us to calibrate various parameters of WiPal, validate its trace merger s operation, and show its efficiency. We believe that WiPal will be of great utility for the research community working on wireless network measurements Applying WiPal: empirical analyses WiPal enables us to carry analyses on datasets we collect using Wi-Fi sniffing. The second part of this thesis presents two of these analyses. The first one focuses on the accuracy of Wi-Fi sniffing. The second one studies Wi-Fi usage patterns in environments with different sociological meanings. First, we collect short-lived (up to two hours) datasets using up to eight monitors sharing the same location. Analyzing these traces reveals that existing techniques to evaluate trace

20 Outline completeness are inaccurate. Among other issues, we observe that a single buggy device can be responsible for blundering the whole system. Second, we investigate how the number of sniffers impacts trace completeness. We show that even though individual sniffers may provide good accuracy, sometimes using eight sniffers is still not enough to capture all frames. Furthermore, the sniffing process exhibits a high level of randomness with variable accuracy. Second, we record and analyze long-lived traces (three-day long and ten-day long) obtained in three environments: an office, a dense uptown residential area, and a sparse suburban residential area. We focus on the behavior of the devices rather than on the traffic characteristics. We are interested in observations like the total duration a device is active, the frequency of appearance of new devices, and activity that can be extracted from traces. Among a number of results, we show that: (i) independently of the trace, most devices are inactive most of the time, (ii) due to mobility, two traces have a constant discovery rate of new users, even after days of measurements, and (iii) as the environments are part of users life along a typical day, activity intensity alternates between residential and office areas. 1.4 Outline For the sake of clarity, we decided to split the description of related works: each chapter includes the part related to its concerns. There is four chapters divided in two parts: the first part focus on WiPal while the second part presents empirical studies. The first part features chapters 2 and 3. Chapter 2 gives a general overview of WiPal. It presents its original features and design, and also include a performance evaluation of this design. Chapter 3 focus on WiPal s trace merging process. It presents some algorithms and optimizations that WiPal uses for this process. This includes an evaluation of WiPal performance with regard to trace merging. The second part presents the empirical analyses we performed using WiPal. It features Chapters 4 and 5. Chapter 4 is a study of Wi-Fi sniffing accuracy. Chapter 5 studies Wi-Fi usage patterns in various environments. Finally, appendices include WiPal s manual and a list of references.

21 Part I The WiPal trace manipulation framework 7

22

23 Chapter 2 WiPal: overview and design WIPAL emerged from our needs to manipulate Wi-Fi packet traces. At its creation, existing tools lacked features regarding some operations (e.g., merging or extracting statistics). We therefore developped WiPal to fulfill these needs. WiPal has a number of features, but its most significant one is certainly trace merging, and it includes some original algorithms to this regard. 1 This chapter reports on our experience designing the WiPal framework: we draw attention on WiPal s design, and some of its features, which are original and might benefit other software developers faced with similar issues. WiPal is free software, available at Appendix B includes WiPal s manual. In the following, Section 2.1 first gives a short overview of existing software for packet traces manipulations. Then Section 2.2 gives an overview of WiPal s design and features. The two subsequent sections focus on WiPal s modules; Section 2.3, addresses packet parsing, while Section 2.4 describes WiPal s pipe and filter mechanisms. Eventually, Section 2.5 evaluates WiPal s performance. 2.1 Trace manipulation tools: related work Packet traces are lists of network packets, either synthetic or, more commonly, acquired by tapping a network medium. They are involved in a significant part of network operations: administrators use them for monitoring and troubleshooting, researchers use them in measurements, simulations, or validations. As a consequence, many tools exist for their creation and manipulation (this includes, for instance, visualization or filtering) [29]. A common format is also prevalent for packet trace operations: pcap (packet capture) [7]. However, most packet trace processing tools are designed for a very specific goal, and carry their own packet-processing code. For instance, tcpdump understands many network protocols, but its parsing code may not be used for other purpose than displaying packets on a terminal [8]. Many tools rely on libpcap, but this library focuses on capturing packets 1 Chapter 3 details WiPal s merging algorithms. 9

24 Overview of WiPal and does not export parsing or processing capabilities [7]. Wireshark is more modular, but still mostly visualization-oriented [6]. All these programs have a good design, and they are very efficient regarding their focus, but each time one creates another tool to handle packet traces, it is impractical to rely on previous code. Scapy is a notable exception [5]. It is an interactive packet manipulation program written in Python, able to parse many network protocols, providing features to read and write trace files, or interact with the network. Nevertheless, the scripting-nature of Scapy makes it dedicated to prototyping, experiments, short programs, or at least programs where performance is not an issue. Scapy s features also stop at the packet level; they do not provide trace-processing algorithms. Another interresting software is binpac [47]. binpac is a parser generator similar to Yacc [40] but it focuses on network protocols. binpac is efficient but only handle unicast streams. It is therefore not suited for sniffer traces. We designed WiPal to solve these issues regarding reusability and performance. Although WiPal focuses on Wi-Fi traces, it provides several protocol-agnostic features. 2.2 Overview of WiPal At the very beginning, WiPal started as a limited C++ library to parse IEEE frames. The library then grew with the applications using it. Due to our focus on reusability, we designed these applications as shells around WiPal s features. Each new feature was therefore part of WiPal rather than of a specific program. Eventually, applications relied so much on WiPal and had so few features of their own that they were merged into WiPal. Our need for solutions regarding all aspects of packet trace processing made WiPal a consistent generic library rather than a patchwork of specific features. Another critical aspect of WiPal is performance. In the literature, some libraries exist to help users parse various network protocols but they are only available for scripting languages (e.g., Scapy [5] ). While this is especially well-suited for quick prototyping and experimentation, this is intractable for handling large traces (especially for heavy computations). WiPal s most complex utilities can process gigabyte-long traces in minutes. WiPal uses C++ because of this combined importance of reusability, performance, and genericity Features WiPal comes both as a library and a set of binaries (programs). Binaries provide a quick and simple interface for high-level features, but these features are also available as library services. Low-level features though are only available through the library. As an example, one does not need to write a program for merging several trace files. The following command does the job: $ wipal-merge t1.pcap t2.pcap [t3.pcap...]

25 Chapter 2. WiPal: overview and design 11 1 #include <wipal/pcap/stream.hh> 2 #include <wipal/wifi/frame.hh> 3 4 using namespace wpl; 5 6 int main() 7 { 8 pcap::file<> f ("file.pcap"); 9 10 for (pcap::file<>::iterator i = f.begin(); i!= f.end(); ++i) 11 std::cout << wifi::type::names[wifi::type_of(i->bytes())] << std::endl; 12 } Listing 2.1: A sample program using WiPal. This program prints the type of every IEEE frame included in file.pcap. High level features include trace synchronization (using the wipal-synchronize binary), trace merging (using wipal-merge), statistics extraction (wipal-stats), trace anonymization (wipal-anonymize), and various minor utilities such as comparison, concatenation, or hexadecimal dumping (wipal-cmp or wipal-cat, to name a few) The most important low-level features are pcap file I/O, IEEE parsing, and support for other IEEE related protocols. Note that wipal-merge s code is just a shell around the library features. As of this writing, WiPal binaries have an average length of 122 lines of C++ (the whole WiPal, including the library, is about 20k lines of code). The smallest binary is 44 lines of code, and the biggest 267. Most of this code is boilerplate due to specific C++ programming techniques. On the other hand, performing a specific task using WiPal s parser, or combining several treatments in one executable file, requires users to write their own programs using the Wi- Pal library. Listing 2.1 shows a sample program using this library. This program has few features, but other snippets will extend it in the following sections Overall architecture Figure 2.1 presents a simplification of WiPal s structure. Binaries (on top) rely on the library, and the library itself relies on other external libraries. The WiPal library is also composed of several modules. We classify each module into one of base, protocols and file formats, or filters. Base. These modules provide simple and common features, unrelated with WiPal s specific domain. For instance, they include various exceptions for error handling, generic abstract classes, and static programming helpers. We kept this layer as thin as possible thanks to external libraries such as Boost [1] or GNU MP [3].

26 Packet parsing Figure 2.1: WiPal s structure and modules. Protocols and file formats. These modules are domain specific and provide the base to high level processing. They feature abstractions such as IEEE 802 addresses, pcap traces, and protocol headers. Filters. One may view a packet trace as a packet stream. Most algorithms just read this stream linearly, each packet after another, from its beginning to its end. This mode of operation particularly suits using a pipe and filter pattern [17]. Therefore, we base WiPal s high level modules on this design. WiPal provides pipe input and output through iterators [32]. The instantiation of a filter object requires one or several iterators as input, and each object provides an output iterator. For instance, an anonymizer filter reads packets and outputs anonymized packets. A merge filter requires two packet streams as input but produces one output stream. Some processings need adaptation to this pattern. For instance, simultaneously synchronizing and merging two IEEE packet traces is a complex operation [22;44;59]. Implementing it in WiPal means decomposing it into several filters and then using a specific wiring for these filters (see Figure 2.2). Every algorithm that needs to access a packet trace non-linearly needs such an adaptation. The base modules, as well as some of the protocols and file formats modules form the lower modules of the WiPal library. On the other hand, filters form the higher modules. 2.3 Packet parsing Although network packets often use a binary format, parsing them is not always straightforward. This is the case even when considering only Wi-Fi packet traces. Furthermore,

27 Chapter 2. WiPal: overview and design 13 Figure 2.2: A complex filter example. This figure shows how WiPal uses filters to synchronize and merge two IEEE traces. Each box represents a filter and arrows show pipes. Pipes convey different types of data. distinct traces may involve distinct formats in addition to IEEE (see Section below). IEEE packets may have several types and subtypes, and each type/subtype pair yields a distinct format (although all formats share some similarities). A well-crafted program should handle as many formats as possible, and handle each field properly according to its type. Implementing a new format should not need modifying existing code. It should be possible to perform various processing on the same frame without modifying the frame parser. In the following, we describe WiPal s mechanisms that enable users writing such programs PHY headers IEEE packet traces often include extra information about the physical parameters of transmissions (e.g., frequency, signal-to-interference ratio, or precise timestamp). This information is available as an extra packet header inserted by the operating system for each frame. We call this header a PHY header. A pcap file is a succession of chunks, each chunk containing a pcap header and a byte sequence corresponding to a packet. PHY headers are located at the beginning of this byte sequence, between the pcap header and the IEEE header. Inside packet traces that do not include PHY headers, an IEEE header appears directly after each pcap header. There is no reference format for PHY headers: hardware vendors introduced them independently of any standardization process. Open source developers push the Radiotap format [4] as a de facto standard, but many traces are already available in other formats (e.g., AVS or Prism), and some network drivers do not support Radiotap. Furthermore, some

28 Packet parsing 6 template <class PHY> 7 void 8 print(pcap::file<>& f) 9 { 10 for (pcap::file<>::iterator i = f.begin(); i!= f.end(); ++i) 11 { 12 const PHY* phy = static_cast<const PHY*> (i->bytes()); 13 const void* ieee80211 = phy->decapsulate(i->meta().caplen, i->swapped()); std::cout << wifi::type::names[wifi::type_of(ieee80211)] << std::endl; 16 } 17 } int main() 20 { 21 pcap::file<> f ("file.pcap"); switch (f.type()) 24 { 25 case pkt::ieee802_11: print<phy::empty_header<> >(f); break; 26 case pkt::ieee802_11_radio: print<rtap::header>(f); break; 27 case pkt::ieee802_11_radio_avs: print<avs::header>(f); break; 28 case pkt::prism_header: print<prism::header>(f); break; 29 } 30 } Listing 2.2: The program of listing 2.1 with support for multiple PHY headers. developers are reluctant to use it due to its variable-length headers. As a consequence, interoperability between IEEE tools and packet traces is problematic. Most researchers develop their tools for the specific PHY headers they use, and different tools might not be able to process the same traces. Sometimes the features provided by two PHY formats are not even compatible! WiPal solves this issue using a proper abstraction for PHY headers (see listing 2.2). WiPal users can handle any PHY header using the same consistent API (Application Programming Interface), as shown in lines of listing 2.2. Note that users need to test the format of the trace file to setup the proper PHY header type. This is the purpose of the switch statement line 23. The reason is that each PHY header s C++ type is part of a static class hierarchy [16], thus no dynamic method resolutions are possible. We wanted to avoid dynamic resolutions because a trace file may contain several hundred million packets, and we wanted to minimize the number of dynamic method calls for each packet (for the sake of performance). WiPal binaries factorize case statements and avoid this redundancy using the Boost preprocessor library [1].

29 Chapter 2. WiPal: overview and design 15 1 const uint8_t* offset = static_cast<const uint8_t*>(frame) + 30; 2 uint16_t eth_type = tool::extract_big_endian_short_u(offset); 3 uint8_t ip6nxthdr = offset[8]; 4 uint8_t icmp6type = offset[42]; 5 uint16_t udp6port = tool::extract_big_endian_short_u(offset + 42); 6 7 if(eth_type == 0x86dd and ip6nxthdr == 0x11 and udp6port == 698) 8 //... 9 else if(eth_type == 0x86dd and ip6nxthdr == 0x3a and icmp6type == 0x86) 10 //... Listing 2.3: A typical example of packet processing code. The code is error-prone, depends on the whole protocol stack, and does not handle truncated frames IEEE parsing Several practices exist regarding IEEE frame parsing. A common malpractice among researchers is to feed a program such as Wireshark or tcpdump, which parses frames and output human-readable text, and then use a scripting language such as Perl to re-process this output. This should be avoided for three reasons: 1. It processes each frame twice. One of the processings is done by a scripting language and involves regular expressions. This is under-efficient for parsing a binary format. 2. Script code that involves regular expressions is error-prone and more difficult to maintain. In this case, the code also depends on the specific version of Wireshark or tcpdump used, as their outputs may change between versions. This imposes an extra burden on maintainance. 3. This often results in overspecialized code. A change in the sequence of protocols each packet uses might break the whole program. Another practice consists in focusing on the specific bytes we are interested in for each frame. This often produces error-prone and hard-to-maintain code. Listing 2.3 is a typical example of such code. Who could tell it does check for pseudo-ethernet frames that include either OLSR messages in IPv6 UDP packets, or ICMPv6 router advertisements 2? Even with proper comments and constants, one would need to be very careful about the offsets and the protocols under test (and we do not even mention handling truncated frames). Another problem with this technique is that it is specific to a given problem per se change only one layer of the protocol stack, and the whole code must be rewritten. Wireshark adopts a valid approach. Its frame parsing component generates a syntax tree and one can access each of the frame s field using a consistent API. We believe that this 2 OLSR (Optimized Link State Routing) is a routing protocol. UDP (User Datagram Protocol) is a transportlevel protocol. ICMP (Internet Control Message Protocol) is a protocol from the Internet protocol suite. IPv6 and ICMPv6 refer to the last version of the internet protocol, which succeed IPv4 and ICMPv4.

30 Packet parsing 6 struct hooks: public wifi::dissector_default_hooks 7 { 8 void seq_ctl_hook(const void* ieee80211, 9 size_t ieee80211_len, 10 unsigned fragno, 11 unsigned seqno) 12 { 13 std::cout << seqno << / << fragno << std::endl; 14 } 15 }; template <class PHY> 18 void 19 print(pcap::file<>& f) 20 { 21 for (pcap::file<>::iterator i = f.begin(); i!= f.end(); ++i) 22 wifi::dissect<phy, hooks>(*i); 23 } Listing 2.4: A program using WiPal s IEEE parser. It uses the same main() function as listing 2.2. approach is however overkill for WiPal. Many algorithms only focus on a few fields inside each frame; in this way, there is no need to spend resources on allocating and constructing a whole structure. Furthermore, handling each frame element would be a waste of time, e.g., when one needs only two of them. Instead of generating a syntax tree, WiPal s parser calls user-given callback functions at various stages of its processing (for instance, for each address field or each time a sequence control field is encountered). When retrieving a specific field is unnecessary, the user just provides an empty callback (WiPal actually provides empty callbacks by default). Callbacks are static parameters to the parser, therefore compiler optimizations (function inlining and dead code elimination) ensure efficiency. Listing 2.4 shows an example. The print() function of listing 2.2 now parses the frame using callbacks (hooks) defined in line 6. The parser calls the seq_ctl_hook() function each time it parses a sequence number field. Note that, despite some frames may be truncated or may not include sequence numbers, this is transparent to the user. Also note that the user does not need to care about bit manipulations (inside IEEE frames, sequence numbers and fragment numbers are respectively 12- bit-long and 4-bit-long fields embedded into a 16-bit-long word, using the network byte order). Finally, it would also be possible to use WiPal s parser to build a syntax tree. To this end, one just needs to implement the suited callbacks.

31 Chapter 2. WiPal: overview and design Filters WiPal processes packet traces using a pipe and filter pattern [17]. Iterators provide pipe input and output [32]. A filter is an object that takes iterators as input and makes an output iterator available. This section presents the benefits of using this pattern for packet processing. Trace files using the pcap file format provide the base iterators for filters. Therefore this section also presents the abstractions that provide basic iterators from pcap files, although they are considered lower modules of the library Filter sources: pcap abstractions pcap is the de facto standard for handling packet traces [7]. The format is both simple to read and simple to write, and may handle any type of packet traces, which explains its wide acceptance. Although some other formats exist (e.g., the formats used by Cheng et al. [22] or VeriWave [57] ), WiPal does not implement them for now as they are still barely used. But one could easily implement them with only minor intrusions into WiPal s code. It is important to underline that tools exist to convert other formats to pcap. WiPal provides several abstractions for reading and writing pcap files. The following sections elaborate on three original features of WiPal s pcap system: (i) random access to a pcap file, (ii) ability to aggregate several files as one pcap stream, and (iii) ability to attach meta-data to a pcap stream. Random access to a pcap file A basic usage for a pcap stream is to retrieve an iterator pointing to the stream s first packet. Incrementing this iterator enables then the user to traverse the packet stream. But WiPal also features another access mode. One can retrieve iterators pointing to arbitrary packets in constant time. Random access to a packet is useful to focus on a trace s specific portion in an efficient way. Here is an example. When opening a pcap trace, standard trace visualizers start by loading the whole trace into memory. Browsing the list of packets just requires memory accesses. This works for traces of reasonable size, but traces used in network research are frequently several gigabytes long [50]. Such traces cannot be loaded into memory. For instance, Wireshark on a GNU/Linux machine with 2 GB of RAM is unable to load traces of more than 500 MB. A solution would be to load into memory only the part of the trace the user is displaying at a given time. But as the user moves inside a trace, the program must be able to quickly load the correct part. From a programming point-of-view, it is not possible to re-traverse the whole trace each time, for performance reasons. Thus the need to access, in constant time, any specific packet inside the trace. WiPal achieves constant access to random packets using file indexes. When opening a pcap file, WiPal performs a single file traversal and records its position into the file every K

32 Filters Figure 2.3: A screenshot of WScout [24]. WScout uses WiPal s random access feature to open packet traces that do not fit in memory. packets (K is a customizable parameter). When required an iterator to the p th packet, it seeks to the recorded position of packet p/k and then traverses p mod K packets. Since seek() is constant time, and at most K read operations are required (K being fixed and independent of the trace file), random access is O(K) = O(1). The smaller the K, the faster the operation, but also the larger the index s memory footprint. Also note that building the index requires a single trace traversal. Therefore, this indexing mechanism is optional, so users can disable it in case they do not need random access. As a proof of concept, we developed a trace visualizer using this feature: WScout [24]. Figure 2.3 displays a screenshot. Thanks to WiPal s random access feature, WScout is able to display in a graphical interface packet traces too large to fit into memory. To the extent of our knowledge, WScout is the only visualizer with such a feature. It is available as free software at

33 Chapter 2. WiPal: overview and design 19 pcap file aggregation A common practice when capturing packets is to split the resulting packet stream into multiple files. Some tools require it in order to generate long traces (e.g., more than 2 GB). Crawdad s uw/sigcomm2004 dataset includes such traces [50]. To later process these traces, one must consider the concatenation of the trace files as one unique packet stream. Despite looking like a minor issue, this is an annoying burden for developers one would like to focus on the processing logic rather than working around measurement quirks. WiPal enables users to consider multiple pcap files as one single pcap stream. Adding this feature to a program is as simple as replacing every occurrence of pcap::file<> with pcap::list<> (e.g., in the preceding code snippets). One can then use a specific syntax to aggregate files. For instance, opening "file1.pcap:file2.pcap" will generate a stream that outputs packets from file1.pcap first, then file2.pcap. Note how this operation is transparent to end-users. Other services are also available, for instance to check consistency of the list (e.g., to check that every file in the list use the same PHY format). Packet stream meta-data Trace files are often associated to some information they do not include directly. A common example is the IP or MAC address of the machine that generated the trace file (i.e., that performed the packet capture). Such information can be useful, for instance, if this machine injected packets during the capture, and one needs to filter these packets out when processing the trace (e.g., because their timestamps are less accurate). A common practice is to embed these pieces of information into the traces file names. Some tools require the users to arrange trace files according to a specific filesystem tree. In order to ease the programming of such mechanisms, every packet stream in WiPal can embed meta-data. Streams meta-data in WiPal takes the form of a mapping from a string to an object of any type. Users can therefore attach any needed piece of information to a packet stream. pcap lists use this mechanism. For instance, when opening "file.pcap= " or "foo.pcap:bar.pcap= ", WiPal associates the given IP address to the corresponding stream, under the string addr. WiPal s trace merging services use such information Processing filters Filters are the core of WiPal s advanced processing features. WiPal features a dozen filters related to trace merging, synchronization, or anonymization. This section illustrates with a simple example how they can improve code quality when dealing with packet traces. The example is a program that anonymizes a packet trace and then prints statistics concerning the resulting trace. This needs two filters and two data sinks. The filters are an anonymizer and a timetracker, and the sinks are a pcap output stream (for the anonymized trace) and

34 Filters Figure 2.4: A simple processing pipeline using two filters (represented as white boxes). Listing 2.5 displays the code implementing this pipeline. a statistic extraction module. The anonymizer filter reads IEEE frames and outputs a copy of these frames truncated at the end of the MAC layer, and where MAC addresses and ESSIDs (network identifiers 3 ) have been replaced with random values. The timetracker filter is in charge of extracting precise timestamps from PHY headers (it fallbacks using pcap timestamps when there are no PHY headers). It also handles wraparounds (some timestamp formats roughly wrap every one hour and a half) so it produces monotonically increasing timestamps. Having as precise as possible timestamps is necessary in order to compute statistics about the packet stream. Figure 2.4 shows how to connect each element. The input file links to the anonymizer, and the anonymizer to the timetracker. We send then the timetracker output to both the output stream and statistics module. Listing 2.5 implements this program. One can distinguish three parts: type definitions (lines 10 12), object declarations (14 17), and processing (19 24). Inside WiPal, every filter is a class, therefore type definitions setup two type aliases for the filter classes: anonymizer and timetracker. One can notice that the C++ types for filters embed the type of their input iterators. This is a drawback of using static C++: when using many filters, one starts with a long list of typedef s, and this requires the user to juggle with type names. It is however important to note that these static mechanisms enable compilers to perform optimizations and produce efficient code. This is the key to WiPal s performance. Furthermore, type checking ensures correctness and safety, as the compiler does not let users mistake with this part of the code. Finally, C++0x, the next release of the C++ standard to be published soon [18], will solve this problem, including features making writing these type definitions useless (thanks to the auto keyword). The next part of listing 2.5 (lines 14 17) declares the filter objects and end-modules. Connecting the filters is achieved by giving the proper iterators to the filters constructors (lines 14 15). Note that when the program runs, at this stage, no processing has started. Filters operate in a lazy fashion: the input file will not be read until we start reading the timetracker s 3 ESSID stands for Extended Service Set Identifier.

35 Chapter 2. WiPal: overview and design 21 6 template <class PHY> 7 void 8 process(pcap::file<>& f) 9 { 10 typedef filter::anonymizer<pcap::file<>::iterator, PHY> anonymizer; 11 typedef filter::timetracker<typename anonymizer::iterator, PHY> timetracker; 12 typedef typename timetracker::iterator iterator; anonymizer a (f.begin(), f.end()); 15 timetracker t (a.begin(), a.end()); 16 pcap::ostream o ("output.pcap", f); 17 wifi::stats::stats s; for (iterator i = t.begin(); i!= t.end(); ++i) 20 { 21 o << *i; 22 s.account<phy>(*i); 23 } 24 std::cout << s << std::endl; 25 } Listing 2.5: An example of advanced trace processing using filters. This program uses the same main() function as listing 2.2. output. Furthermore, filters only load into memory the data they need for producing their next element, nothing more. Finally, the last part of listing 2.5 (lines 19 24) reads each output frame from the timetracker, and sends it to the end-modules. Sending packets to a pcap::ostream object using the << operator transparently write the corresponding pcap file. The following method call to account() updates the statistics module. It is then possible to report these statistics on the standard output using standard C++ streams and formatting operators. These statistics include frame counts and traffic rates for each frame type/subtype, estimations of missed frames, list of networks and cells, information about transmitters, and other various figures. There are two important points about this program. First, this is very easy to alter its behavior by just adding or removing the desired filters. For instance, one could add a filter before the anonymizer to filter a certain type of packets out. Or the anonymizer could be removed thus making the program a simple statistics extractor. The second point is that filters are an easy mechanism to parameterize a processing. Some processings have parts that can be implemented with different algorithms (for instance, a merge process might use several synchronization algorithms). In such cases, testing various algorithms is just as simple as changing the corresponding filter, without altering the remaining of the pipeline. In other words, filters enable decomposing trace operations into several basic blocks, thus making trace processing modular. As a consequence we can expect programs to be easier to maintain and adapt, and code to be easier to re-use.

36 Performance evaluation 2.5 Performance evaluation We evaluate WiPal s efficiency using nine test programs involving WiPal and some wellknown packet processing software. We are both interested in how WiPal performs with regard to other programs and in the overhead generated by WiPal s original features (namely trace aggregation, random packet access, IEEE parser, and filters) Methodology We use nine test programs. Here is a short description for each of them. libpcap. This is a simple program that uses libpcap [7] to perform a single pcap file traversal. Packets are discarded immediately after being read from the file. We use this program as a reference. WiPal-file. This is the same program as above, using WiPal instead of libpcap. It uses pcap::file objects and its code is very similar to listing 2.1. The goal is to compare WiPal s pcap reading mechanisms to libpcap s. WiPal-list. This program is the same as WiPal-file, using pcap::list objects instead of pcap::file (see Section 2.4.1). We use this test to measure the overhead of WiPal s file aggregation feature. WiPal-parser. This program performs a single file traversal, calling WiPal s IEEE parser on all frames composing the trace. We use the default behavior of WiPal s parser, which is to call empty callbacks. This allows us measuring the overhead of an empty parser. In the ideal theoretical case, the C++ compiler would optimize the code out and the program would exhibit performances similar to WiPal-list. We also compare this program to Scapy (see below) that performs basically the same task. WiPal-random. This program tests WiPal s random access feature (see Section 2.4.1). It starts by building an index of its input file, then perform successive access to random packets. The number of random accesses is twice the input trace s packet count. Therefore, it does the equivalent of three file traversals: one using standard iteration mechanisms and two using random accesses. If one subtracts from its execution time the time for building the index (estimating it with WiPal-file) and divides the result by two, the result is the average time of a single random traversal. One can use this result to compute the overhead of random access over conventional iteration. In this program we use K = 4. This value ensures fast random access while keeping a reasonable memory footprint. Measurements show that WScout [24], using WiPal s indexes with K = 4, is able to load a 22 GB trace (including about 108,000,000 packets) using a total of 560 MB of virtual memory.

37 Chapter 2. WiPal: overview and design 23 WiPal-filters. This is the program of listing 2.5. The goal is to have an idea of how a moderately complex processing performs using WiPal s filters (anonymization and statistics extraction running simultaneously). Of course, this is not directly comparable to tshark or tcpdump (see below) because each program implements different features. But we expect these programs to have execution times in the same order of magnitude. Scapy. This program is very similar to WiPal-parser in its features. It uses Scapy s sniff function to read its input file. Scapy parses each packet it reads, but we setup the function to immediately discard the packet without further processing afterwards. We setup Scapy to parse only the MAC layer. tshark. This is the plain tshark program, which is the console version of Wireshark. It reads the input file, parses each packet and display a text summary on standard output. tshark relies on libpcap for its I/O operations. tcpdump. This is the plain tcpdump program, which basically offers the same features as tshark. Contrary to Wireshark, it uses a custom parser dedicated to printing packet summaries on a terminal, and we expect it to be faster than Wireshark. tcpdump also relies on libpcap for its I/O operations. In order to evaluate a program, we feed it with a 460 MB pcap trace (including about 2,100,000 packets). We run each program a hundred times, measuring its execution speed (accounting only the user and system time as reported by the time UNIX command). We then compute the mean execution time and 95% confidence intervals. We always use the same trace file: (i) we do not expect another trace with a similar size to lead to significant differences, and (ii) each of these program is linear w.r.t. the trace size, so the average processing time per packet will not change with bigger or smaller traces. The trace comes from a real-world measurement and may be considered average-sized for measurements in wireless environments. In order to avoid disk slowdowns, we store this file in a RAM disk and we redirect all outputs to /dev/null. The machine executing these tests is a dual-core Intel R Pentium R D CPU at 3 GHz, with 2 GB of RAM and a 2 MB cache Results Figure 2.5 displays the results. It is important to keep in mind that many of these programs do not have the exact same features as the others. Therefore, most of the time, one should not expect precise comparisons from these results. They rather give an idea about order of magnitudes for a typical trace. We can nevertheless draw a number of interesting conclusions. Comparison with libpcap. A first thing to notice is that WiPal s packet reading features perform almost as well as libpcap s (WiPal-file is 120 ms slower, for a total execution time

38 Performance evaluation Mean execution time 1 h 1 min 10 s 1 s 830 ms 950 ms 950 ms 970 ms 28 s 13 s 70 min 70 s 12 s libpcap WiPal-file WiPal-list WiPal-parser WiPal-filters WiPal-random Scapy tshark tcpdump Figure 2.5: Mean execution time for a hundred runs of the various test programs. Note that most 95% confidence intervals are too small to be distinguished clearly. of nearly a second). This extra delay is negligible: as shown in WiPal-filters or tcpdump, on more elaborated processings, the time actually spent performing I/O operations is small compared to the time spent performing other computations. The important point is that using iterators does not sensibly impact the performance of WiPal s C++ API. WiPal s I/O speed is comparable to libpcap s. Overhead of WiPal-list and WiPal-parser. It is interesting to note that WiPal-list leads to the same execution time as WiPal-file and that WiPal-parser perform almost as well as the previous two (a couple dozen milliseconds slower). One can draw two conclusions: (i) the file aggregation feature has a negligible cost, and (ii) the generic parser implementation using static callbacks is efficient. This means only user-provided callbacks might cause a sensible overhead. Overhead of random access. The WiPal-random program runs in 28 seconds. Thus, we can estimate that traversing the trace once using a random order takes less than 14 seconds (remember the WiPal-random program performs one sequential and two random access per packet). This makes random access to a packet roughly 14 times slower than sequential access in practice. In theory, however, with K = 4, this should only be twice slower on average. We believe the difference between theory and practice is due to the fact that a random file access at the standard library level breaks the underlying buffering mechanisms (whereas a sequential access does not). As a conclusion, random access is significantly slower but is still reasonable with regard to the feature offered. This extra delay is of the same order of magnitude than the one of other processings such as WiPal-filters or tcpdump. Overhead of using filters for advanced trace processing. WiPal-filters runs in 13 seconds. This is about the same execution time as tcpdump, while tshark and Scapy are at least an

39 Chapter 2. WiPal: overview and design 25 order of magnitude slower. Therefore, WiPal s design does not hinder its efficiency: by using filters, WiPal achieves performance levels that are similar to specialized programs. WiPal-filters use two filter objects: an anonymizer and a timetracker. The anonymizer relies in part on WiPal s generic IEEE parser and the timetracker uses PHY abstractions to extract timestamps from PHY headers. On the one hand, this means that WiPal s genericity does not preclude it to compare to specialized code. On the other hand, the extra-genericity in tshark s design (compared to tcpdump) is at the cost of reduced performance (tshark is about seven times slower than tcpdump or WiPal-filters). Scapy is even slower, requiring more than one hour to process the trace. In this case, the first cause is its implementation language (Python). Of course, scripting languages are known to be slower than compiled ones, but they are also dynamic. Therefore, they lack several optimization opportunities. For instance, Scapy cannot optimize the parser out, even though each packet is discarded, while it is possible with WiPal-parser that provides the same features as Scapy. As a conclusion for this section, WiPal does not sacrifice performance to reusability. It may even outperform existing state-of-the-art programs. Its I/O operations are almost as fast as libpcap s despite the extra features. It also has a generic design that can compete with specialized code. This is a strong argument towards adopting WiPal, instead of writing specific code when designing a packet trace manipulation software. 2.6 Conclusion This chapter presented WiPal, a packet manipulation framework, and reported on our experience designing it. To the extent of our knowledge, WiPal is the only framework that focuses both on performance and genericity. This makes it a valuable tool for researchers who need to develop packet trace processing software. Though WiPal addresses mostly the IEEE protocol, it also provides several protocol-agnostic features (e.g., pcap I/O operations). Furthermore, WiPal uses patterns that could be useful to handle other types of packet traces and protocols. WiPal introduces a number of original features and a novel design. It features trace anonymization, statistics extraction, synchronization, merging, as well as other miscellaneous operations. Instead of relying on syntax trees, its IEEE frame parser uses a static callback mechanism. By applying modern compiler optimizations, we obtain generic and fast operations. WiPal features the ability to index pcap files, thus allowing random access to packets. These accesses are constant-time and only imply limited overhead. It is also possible to aggregate trace files and to consider the concatenation of several files as one unique packet stream. WiPal also includes a mechanism to attach meta-data to trace files, as they are often associated to data they do not include directly. Finally, WiPal s whole design is based on a pipe and filter pattern. This pattern enables decomposing trace operations into several basic blocks, thus making trace processing modular. The consequence is that

40 Conclusion programs become easier to maintain and to adapt, and code easier to re-use. Measurements show that WiPal compete with state-of-the-art packet processing software. Its I/O operations are almost as fast as libpcap s, and its generic design is as fast as specialized code.

41 Chapter 3 WiPal: IEEE trace merging THE most innovative part of WiPal is probably the one dedicated to merging IEEE packet traces. This merger includes original algorithms and focuses on performance, ease-of-use, and flexibility. We achieve performance using a proper design and careful programming. Ease-of-use and flexibility, on the other hand, are the consequences of a number of characteristics that distinguish WiPal s trace merger from other software: Offline operation. Because it is designed to run offline, WiPal is independent of the monitors. This means that one may use any software to acquire data. Most trace mergers expect monitors to embed specific software [22;28]. Independence of infrastructure. WiPal s internal algorithms do not expect features from traces that would require monitors to access a network infrastructure (e.g., loose sniffer synchronization using NTP, the network time protocol). Monitors just need to record data in a compatible input format. Compliance with multiple formats. WiPal supports most of the existing input formats. On the other hand, other trace mergers require a specific format. Some tools even require a custom dedicated format [22]. Hands-on design. WiPal is usable in a straightforward fashion by just calling the adequate programs on trace files. Other mergers require more complex setups (e.g., a database server [43] or a network setup involving multiple servers [22] ). This chapter explains the design and internals of WiPal s merger. We also intend complement existing papers in the literature and give additional insights about the complex process of trace merging. Section 3.1 first gives an overview of existing trace merging techniques. As every other tools WiPal uses these techniques. Then Section 3.2 explains the basics of Wi- Pal s merging algorithms. Section 3.3 goes into more details, and presents each distinct part individually. Eventually Section 3.4 provides an evaluation of WiPal s efficiency regarding trace merging. 27

42 Trace merging: state of the art A. The traces are not synchronized and miss some frames. B. One identifies some reference frames common to both traces. This information enables trace synchronization. C. One adjusts the frames timestamps and synchronize T 1 and T 2. D. One can merge the traces. Duplicate frames are only accounted once. Figure 3.1: Merging two traces T 1 and T Trace merging: state of the art Wireless sniffing requires the use of multiple monitors for coverage and redundancy reasons. Coverage is concerned when the distance between the monitor and at least one of the transmitters to be sniffed is too large to ensure a minimum reception threshold. Redundancy is the consequence of the unreliability of the wireless medium. Even in good radio conditions monitors may miss successfully transmitted frames. After the collection phase, traces must be combined into one. A merged trace holds all the frames recorded by the different monitors and gives a global view of the network traffic. The traditional approach to merge traces involves a synchronization step, which aligns frames according to their timestamps. This step includes identifying the frames that are identical in all traces so that they appear once and only once in the output trace (Cheng et al. [22] refer to this operation as unification). Figure 3.1 illustrates this process (more details are given in Section 3.3). Synchronization is difficult to obtain because, in order to be useful, it must be very precise. Imprecise frame timestamps may result in duplicate frames and incorrect ordering in the output trace. An invalid synchronization may also lead to distinct frames accounted as the same frame in the output trace. In order to avoid such undesirable effects, one needs

43 Chapter 3. WiPal: IEEE trace merging 29 precision of less than 106 µs [59]. To the best of our knowledge, only the VeriWave WaveTest appliances [57] are able to synchronize network cards clocks with such a precision (note that we are interested in frame arrival times in the card, not in the operating system). But this requires a specific wiring among each sniffer, and this hardware is expensive. Therefore, all merging tools post-process traces to resynchronize them with the help of reference frames, which are frames that appear in multiple traces. One may readjust the traces timing information using the timestamps of the reference frames (see Figure 3.1). Finding reference frames is however a hard task, since we must be sure a given reference frame is an occurrence of the same frame in every traces. That is, some frames that occur frequently (e.g., MAC acknowledgements) cannot be used as reference frames because their content does not vary enough. Therefore, only a subset of frames are used as reference frames, as explained later in this paper (cf. Section 3.3). A few trace merging tools exist in the literature, but they do not focus on the same set of features as WiPal. For instance, Jigsaw is able to merge traces from hundreds of monitors, but requires monitors to access a network infrastructure [22]. This paper however considers smaller-scale systems (dozens of monitors) but where no monitor can access a network infrastructure. WisMon is an online tool that has requirements similar to Jigsaw [28]. CrunchXML [51] is a tool that uses the same merging algorithm as WisMon, but that can work either online or offline. However, due to this algorithm, its operation needs all sniffers to hear a common access point. In order to work in all kinds of environments, WiPal cannot make such an assumption (sometimes access points are not shared among all traces, or there are no access points). The system that is the closest to ours is Wit [43;44]. Although Wit provides valuable insights on how to develop a merging tool, it is difficult to use, modify, and extend in practice (cf. its authors note in CRAWDAD [44] ). 1 This explains in part our motivation to propose a new trace merger. 3.2 WiPal s basics WiPal has been designed according to the following constraints: No wired connectivity. The sniffers must be able to work in environments where no wired connectivity is provided. The idea is to be able to perform measurements when it is difficult to have all sniffers access a shared network infrastructure (e.g., in some conference venues, or when studying interferences between two wireless networks belonging to distinct entities). 1 Note that we refer to Wit s merging process, and not on the other features available (e.g., a module to infer missing packets).

44 Detailed operation of WiPal s trace merging Simplicity to the end-user. We believe simplicity is the key to re-usability. Users are not expected to install and set up complex systems (e.g., a database backend) in order to use WiPal. Clean design. WiPal exhibits a modular design. Developers can easily adapt part of the trace merger or integrate them to other systems (e.g., reference frames identification process, synchronization, or merging algorithm). These constraints require an offline trace merger that does not require traces to be synchronized a priori. In practical terms, this means that sniffers only have to record their measurements on a local storage device, using the widely used pcap file format [7]. With regard to this format, WiPal supports all mainstream PHY headers: raw IEEE frames, AVS, Prism, and Radiotap headers. 2 Some wireless packet traces use another link type though: IP packets encapsulated into pseudo-ethernet frames. It is important to note that such traces are not MAC traces (only IP packets are available) and thus do not contain enough information for accurate synchronization and merging. WiPal merges these traces when requested, but this is an experimental feature that has not been extensively tested. As seen in the previous chapter, adding new link types is straightforward: WiPal s design principles only needs implementing the right abstractions and modifying only a couple of lines in the existing codebase. One can access WiPal s merging services through its software library or using a set of binaries to manipulate wireless traces. All tools work directly on pcap files both as input and output. wipal-merge is the main command to merge an arbitrary number of traces: $ wipal-merge t1.pcap t2.pcap [t3.pcap...] It is worth mentioning that intermediate steps of the merge procedure can be performed separately, such as: $ wipal-intersect-unique-frames t1.pcap t2.pcap $ wipal-synchronize t1.pcap t2.pcap sync_t1.out.pcap 3.3 Detailed operation of WiPal s trace merging Figure 3.2 depicts WiPal s structure. Each box represents a distinct module and arrows show WiPal s data flow. WiPal takes two wireless traces as input and produces a single merged trace. 3 In the following, we explain in detail the functioning of each one of the modules. 2 See Chapter 2, Section for an explanation about PHY headers. 3 In order to merge more than two traces, it suffices to execute the merging tool as many times as required (two by two). The wipal-merge command does this automatically.

45 Chapter 3. WiPal: IEEE trace merging 31 Figure 3.2: The structure of a merge process in WiPal Identifying reference frames This section explains the process of extracting reference frames. This operation involves two steps: extraction of unique frames and intersection of unique frames (see Figure 3.2). Let us first define what a unique frame is. A frame is said to be unique when it appears in the air once and only once for the whole duration of the measurement. A frame that is unique within each trace but that actually appeared twice on the wireless medium should not be considered as unique. The process of extracting unique frames finds candidates to become reference frames. The process of intersecting unique frames identifies then identical unique frames from both traces to become reference frames.

46 Detailed operation of WiPal s trace merging Extraction of unique frames WiPal considers every beacon frame and non-retransmitted probe response as unique frames. These are management frames that access points send on a regular basis (e.g., every 100 ms for beacon frames). The uniqueness of these frames is due to the 64-bit timestamps they embed (these timestamps are not related to the actual timestamps used for synchronization, as we will see later). In practice, the extraction process does not load full frames into memory. It uses 16-byte hashes instead, which are stored in memory and used for comparisons. Limiting the size of stored information is an important aspect since, as we will see later, WiPal s intersection process performs a lot of comparisons and needs to store many unique frames in memory. Tests with CRAWDAD s uw/sigcomm2004 dataset [50] have shown that this technique is practical. For instance, WiPal needs less than 600 MB to load 7,700,000 unique frames. There are some rare cases where the assumption that beacons and probe responses are unique does not hold. The uw/sigcomm2004 dataset has a total number of 50,375,921 unique frames (about 14% of the total 364,081,644 frames). Among those frames, we detected 5 collisions (distinct unique frames sharing identical hashes). WiPal s intersection process includes a filtering mechanism to detect and filter such collisions out Intersection The intersection process intersects the sets of unique frames from both input traces. There are multiple algorithms to perform such a task. Based on Cheng et al. [22], a solution is to bootstrap the system by finding the first unique frame common to both traces and then use this reference frame as a basis for the synchronization mechanism, as shown in Algorithm 1 (we call this algorithm streaming intersection). One may also use subsequent reference frames to update synchronization. This algorithm is practical because the inner loop only searches a very limited subset of I 2. It has several drawbacks though: (i) the performance of the algorithm strongly depends on the precision of the synchronization process; (ii) finding the first reference frame is an issue when no other synchronization mechanisms are available; (iii) this algorithm couples intersection with synchronization, which is undesirable with respect to modularity; and (iv) there is a possibility that some frames are read multiple times from I 2. More specifically, access to I 2 is not sequential. We propose the retained intersection algorithm that is much simpler to implement and that avoids the drawbacks of the abovementioned solution (see Algorithm 2). Its main characteristics are: (i) it does not require a bootstrapping phase; (ii) it does not depend on any kind of synchronization; and (iii) it sequentially reads each frame only once from I 1 and I 2. Our algorithm starts by loading all unique frames of the first trace into memory. This precludes using it as an online tool. Note that loading all unique frames from a trace into memory may also hog resources; this justifies the importance of having small identifiers for

47 Chapter 3. WiPal: IEEE trace merging 33 Algorithm 1 Streaming intersection (uses synchronization). Input: two lists of unique frames I 1 and I 2. Output: a list of reference frames. δ synchronization precision for all u 1 I 1 do t u1 u 1 s time of arrival for all u 2 I 2 between t u1 δ and t u1 + δ do if u 2 is an occurrence of u 1 then Append (u 1, u 2 ) to output. end if end for end for Algorithm 2 WiPal s retained intersection. Input: two lists of unique frames I 1 and I 2. Output: a list of reference frames. h for all u 1 I 1 do Insert u 1 into h. end for for all u 2 I 2 do if h contains an element u 1 equal to u 2 then Append (u 1, u 2 ) to output. end if end for Implement h with a hash table. the unique frames. These constraints are however irrelevant in practice. To support our argument, let us show an example using the uw/sigcomm2004 dataset. The biggest traces are those from sniffers mojave and sonoran on channel 11 (roughly 19 GB each). Extracting these traces unique frames and intersecting them using WiPal needs 575 MB of memory. Therefore, memory aggressiveness is not a concern in our algorithm. Another advantage of the proposed algorithm is its ability to detect collisions of unique frames within the first trace. As indicated in Algorithm 2, this algorithm uses a set h (in practice, implemented using a hash table) that contains unique frames from the first trace. One detects collisions when trying to insert into h an element that is already part of it. When WiPal encounters such cases, it memorizes collisions, and filter them out of the hash table before starting the algorithm s second loop. Of course, collisions in the second trace remain undetected. Even if WiPal detected them, there would still be the possibility that a collision spans across both traces (i.e., each trace contains one occurrence of a colliding unique frame). Such cases lead to producing invalid reference frames. To detect invalid reference frames, WiPal looks at possible anomalies w.r.t. the interarrival times between unique frames. In

48 Detailed operation of WiPal s trace merging Dataset Environment Hardware Id. Chan. uw/sigcomm2004 Conference Laptop /8 Private 1 Private 2 Private 3 Office 3 6 Soekris building 4 6 Uptown 5 11 Netbook apartment 6 6 Office 7 6 Netbook building 8 1 Table 3.1: Characteristics of the traces used for testing merge operations. Id. relates to the identification number of the merge operations. practice, invalid references are rare: only three occurrences when merging uw/sigcomm s channel 11 (a 73 GB input which produces a 22 GB output) Synchronization Synchronizing two traces means mapping trace one s timestamps to values compatible with trace two s. WiPal computes this mapping with an affine function t 2 = a t 1 + b. It estimates a and b with the help of reference frames as the process runs. Several techniques exist to perform these estimations: linear interpolations [44], linear regressions [31;59], or solving linear problems [52]. To combine generality with speed efficiency, WiPal uses a simple generalization of the techniques from Mahajan et al. [44] and Yeo et al. [59]. Note that other techniques could also be implemented without requiring modifications in other WiPal s components. WiPal s synchronization process operates on windows of w + 1 reference frames (finding an optimal value of w is discussed below). For each reference frame R i, the process performs a linear regression using reference frames R i w/2,..., R i+ w/2. At the beginning and at the end of the trace, we use R 1,..., R w and R N w,..., R N (N is the number of reference frames). The result gives a and b for all frames between R i and R i+1. Experiments led us to choose 3 as the optimal value for w (i.e., WiPal performs linear regressions on windows of 4 reference frames). Figure 3.3 shows the results of performing eight merge operations (on sixteen traces from four distinct datasets) with varying window sizes. The merges concern 12 hour-long excerpts of various traces. One of the four dataset is uw/sigcomm2004 while the three others are private datasets we collected. Table 3.1 presents some characteristics of the traces we used for each merge operation. It is important to note that these sixteen traces were collected with various hardware in several environments, on different channels. We define the synchronization difference between two traces as follows. First, consider only the subset S of frames that are shared by both T 1 and T 2. For a given frame f, let t f,1 be the arrival time of f inside T 1 (after clock synchronization) and t f,2 be the

49 Chapter 3. WiPal: IEEE trace merging 35 Merges 1 and 3 to 8 Synchronization difference (µs) Merge Window size (w + 1) Figure 3.3: Synchronization difference w.r.t. linear regression window size. The upper curve represent average, minimum, and maximum values for seven of the eight merges. The lower curve represent the result for the other one, and is plotted separately because it has a singular shape. We think this is related to the timestamping accuracy of the input traces for this merge. arrival time of f inside T 2. The synchronization difference is given by 1 S f S t f,2 t f,1. One can summarize the synchronization difference as the average difference of synchronization between frames that are identified as shared among input traces. With the exception of merge 2 that exhibits a very singular behavior, Figure 3.3 shows that w = 2 leads to the minimum synchronization difference. Note that techniques that use w = 1 (i.e., that performs linear interpolations on couples of reference frames) lead to the worst synchronization difference in average. However, choosing a w that is too low or too high might lead to missing some shared frames. Figure 3.4 shows the number of frames that are identified as duplicates in the input traces with respect to window size. Whereas using 3 w or w 7 allows to detect the maximal number of shared frames, using other values leads to some missed duplicates. Note that w = 1 gives the worst results. That indicates synchronizing traces using linear interpolation (as Wit [44] does) may lead to incorrect results. Therefore WiPal uses w = 3: among the values that detect the maximum shared frames, this is the one that leads to the minimum synchronization difference.

50 Evaluation #shared frames (normalized) Window size (w + 1) Figure 3.4: Number of frames detected as shared by both input traces w.r.t. linear regression window size. The curve represents the average, minimum and maximum values for eight merge operations. For each merge operation, this number is normalized using 1 as the number of frames from the window size that gives the highest value Merging We now present how WiPal performs the final step, namely the merging process itself. Its role is to copy frames from synchronized traces to the output trace. Of course, it must organize its output correctly while avoiding duplicate frames. Algorithm 3 details WiPal s merging algorithm. For the sake of illustration, we present here a simplified version that assumes that only one frame is emitted at a given time inside the monitoring area. It simultaneously iterates on both inputs, where each iteration adds the earliest input frame to the output (lines 15 16). Duplicate frames are the ones that have identical contents and that are spaced less than 106 µs (line 11). The rationale for this value is that 106 µs is half the minimum gap between two valid frames [59]. Therefore, the appearance of identical frames during such an interval is in fact a unique occurrence of the same frame. 3.4 Evaluation This section provides an evaluation of WiPal s merging algorithms using the datasets previously described. We investigate both the correctness and the efficiency of WiPal. We run the merges and then use some heuristics to evaluate the quality of the result. We also analyze WiPal s execution speed Correctness Checking the correctness of merge outputs is difficult. Being able to test whether traces are correctly merged or not would be equivalent to knowing exactly in advance what the merge should look like. Unfortunately, there is no reference output against which we could

51 Chapter 3. WiPal: IEEE trace merging 37 Algorithm 3 WiPal s merging algorithm. Input: two synchronized traces T 1 and T 2. Output: the merge of T 1 and T 2. 1: procedure ADVANCE( f : frame, T: trace) 2: Append f to output; f T s next frame (or nil) 3: end procedure 4: f 1 T 1 s first frame; f 2 T 2 s first frame 5: while f 1 = nil or f 2 = nil do 6: if f 1 = nil then ADVANCE( f 2, T 2 ) 7: else if f 2 = nil then ADVANCE( f 1, T 1 ) 8: else 9: t f1 f 1 s time of arrival 10: t f2 f 2 s time of arrival 11: if f 1 = f 2 and t f1 t f2 < 106µs then 12: Append either f 1 or f 2 to output. 13: f 1 T 1 s next frame (or nil) 14: f 2 T 2 s next frame (or nil) 15: else if t f1 < t f2 then ADVANCE( f 1, T 1 ) 16: else ADVANCE( f 2, T 2 ) 17: end if 18: end if 19: end while compare. Thus, we propose several heuristics to check if WiPal introduces inconsistencies in its outputs. We also check WiPal s correctness with a test-suite of synthetic traces for which we know exactly what to expect as output. A broken merging process could lead to several inconsistencies in the output traces. Regarding our datasets, we investigate in particular two of those inconsistencies: duplicate unique frames and duplicate data frames. Duplicate unique frames. As seen previously, every unique frame should only occur once in the traces (including merged traces). Yet, it is difficult to avoid collisions in practice (see Section 3.3.2). Thus one should not consider all collisions as inconsistencies. After merging, our traces have 6 collisions. After a manual check, five of them are not inconsistencies introduced by WiPal s merging process. The last one is due to a synchronization error of 1.5 millisecond. When looking closer at the output trace, it appears that error spans 4.7 milliseconds and duplicates at most 4 frames (a beacon frame and three identical retransmitted data frames). We believe this is an excellent score, considering our inputs have 79,340,347 frames with various timestamping accuracies.

52 Evaluation Duplicate data frames. We search traces on a per-sender basis for successive duplicate data frames (only considering non-retransmitted frames). Such cases should not occur in theory without retransmissions, sequence numbers should at least vary. Surprisingly, some input traces contain such anomalies. We have no explanations why some datasets exhibit those phenomena. We checked however that the merged trace does not have more duplicates than the original traces (inputs have 1,689 duplicates while the output only has 1,149) Efficiency Trace merging is a run-once operation and WiPal is an offline process. Yet speed is an important metric to consider: It is always desirable to make a program run faster, as long as it does not answer instantaneously. Especially, as the following section shows, WiPal is able to perform in minutes what takes hours with other merging software. Less time spent merging means more time is available for other more important processing (e.g., analyzing the dataset, which might also be a heavy operation). As another example, the merge operation might run on a multi-user system, with other users having some time constraints. Shorter delays between trace collection and trace analysis means more interactivity and gains in productivity (e.g., if the collected traces have issues, it might be desirable to detect it quickly in order to fix the problem, possibly by collecting other traces). Merging all the traces (17.5 GB) takes 35 minutes (real time as reported by the time UNIX command) on a 3 GHz processor with 2 GB RAM. The average CPU usage is 93%. User time, that does not account system delays and thus disk slowdowns, is 31 minutes and 32 seconds. Comparing WiPal with online trace mergers does not make much sense: their mode of operation is different, and these also have different requirements (e.g., wired connectivity and loose synchronization). The comparison would be unfair. We can however compare WiPal with Wit [44], another offline merger. Wit works on top of a database backend, which means that trace files need to be imported into a database before any further operation can begin (e.g., merging or inferring missing packets). Using the same machine as before, importing all input traces into Wit s database takes 8 hours and 20 minutes (user time). This means that, before Wit begins its merge operations, WiPal can perform at least 14 runs of a full merge with the same data. WiPal allows then tremendous speed improvements. One of the reasons for such a difference is WiPal uses high performance C++ code while Wit is just a set of Perl scripts using the SQL language to interact with a database.

53 Chapter 3. WiPal: IEEE trace merging Conclusion This chapter introduces WiPal s trace merger. As an offline merger, it does not require sniffers to be synchronized nor to have access to a wired infrastructure. WiPal provides several improvements over existing equivalent software: (i) it comes as a simple program able to manipulate trace files directly, instead of requiring a more complex software setup, (ii) its synchronization algorithm offer better precision than the existing algorithms; and (iii) it has a clean modular design. Furthermore, we also showed WiPal is an order of magnitude faster than Wit [44], the other available offline merger. We have several plans for the future of WiPal s merging procedure. First, we are currently extending it to include new features. For instance, we are working with other contributors in order to merge other types of packet traces using WiPal s algorithms. We are also working with researchers from the University of California, Los Angeles on new synchronization algorithms. We would also like to make better use of WiPal s modularity and test other algorithms for the various stages of the merging operation.

54 Conclusion

55 Part II Applying WiPal: empirical analyses 41

56

57 Chapter 4 Accuracy of wireless packet sniffing ONCE one has tools for sniffing and merging, the question of trace completeness arises. With Wi-Fi sniffing each sniffer trace is incomplete (i.e., it lacks some frames). Therefore, it is possible that the merged traces are incomplete as well. This chapter focuses on two aspects of trace completeness in IEEE networks. First, we observe that existing techniques to evaluate trace completeness are inaccurate (see Section 4.3). Among other issues, a single buggy device may be responsible for blundering the whole system. Second, we study how the number of sniffers impacts trace completeness (see Section 4.4). Using up to eight sniffers sharing (approximately) the same location, we show that even though individual sniffers may provide good accuracy, sometimes using eight sniffers is still not enough to capture all frames. Furthermore, the sniffing process exhibits a high level of randomness with variable accuracy. To obtain these results we conduct two similar controlled experiments. In each experiment one records a spot s Wi-Fi activity for a given duration using multiple sniffers. All sniffers have (approximately) the same location. It is then possible to analyze each sniffer trace, compare it to each other, and compute merge operations with a varying number of traces. Eventually, studying each merge operation with respect to the number of traces that compose it provides comparative information. This chapter is structured as follows. Section 4.1 presents the existing techniques to estimate trace completeness. Section 4.2 introduces our datasets and provides a preliminary analysis. Then Section 4.3 evaluates our datasets completenesses and draws conclusions about existing evaluation techniques. Eventually, Section 4.4 studies the impact of using multiple monitors on completeness. 4.1 Completeness evaluation: state of the art When collecting IEEE data using wireless sniffing, trace completeness is a key issue. Even under good radio conditions, sniffers may miss a successful transmission. Since 43

58 Completeness evaluation: state of the art missed frames are unrecorded, it is impossible to know exactly how complete a trace is. Several methods exist however to estimate the efficiency of wireless sniffing as a technique. Other methods exists to estimate the completeness of single traces. Here is a panel of previous related works. Yeo et al. [59] use active indoor measurements (in a single university building). They estimate sniffer traces feature at least 73% of all of their experiment s frames. When merging traces from three monitors, they obtain a completeness of at least 99%. Using similar experiments in the same kind of environment, Cheng et al. [22] experience a completeness of 95%. Serrano et al. [54] also perform active measurements using an anechoic chamber. Their results show that single sniffer accuracy varies significantly across sniffers, and that performance may also depend on the nature of the experiment under study and on slight changes of the sniffer position. With this best-case scenario using an anechoic chamber, they obtain a completeness of about 96% for single sniffers, on average. Based on message sequences allowed by the IEEE standard one can infer some missing frames. For instance, since an acknowledgment frame must succeed a successful data frame transmission, a trace containing only one ack. with no preceding data lacks a frame. Of course, other rules exist for other frame types. Using this technique on real traces from an IETF meeting, Jardosh et al. [38] estimates completeness of at least 80% for individual sniffers (due to the dataset, no merging was possible, and therefore no data is available concerning the accuracy of merging). Rodrig et al. [49] use another technique based on frame sequence numbers to estimate the completeness of their traces. This technique is simple: since most IEEE frames contain a sequence number, they look at sequence gaps to estimate missing frames. Using traces they record at the 2004 SIGCOMM conference, they evaluate an overall completeness of roughly 90%. Curiously, after merging the same dataset, Mahajan et al. [43] estimate an equal completeness of 90% for channel 1, but also a lower completeness of 79% for channel 11. Schulman et al. [53] also raise an interesting point: since the parameters that impact trace completeness may vary during measurements one should not use it as an accurate indicator of trace quality. For instance, a sniffer might provide a very accurate recording during silent periods where only a few access points send beacons, but perform very badly when the network load grows. To solve this issue they propose using dedicated T-Fi plots [53]. While we agree with them, we believe however that studying trace completeness is still interesting in some cases. It provides quick insight and is easy to understand. For instance, a trace with a low completeness raises issues, whatever the network load through time. As a summary, existing techniques rely on the fact that network protocols define valid frame sequences. When a trace contains an invalid (incomplete) frame sequence, one finds a number of frames to insert so that the sequence becomes valid. This counts for a number

59 Chapter 4. Accuracy of wireless packet sniffing 45 Figure 4.1: ASUS EeePC 700 with three Netgear WG111v3, as used for trace collection. of missing frames. Regarding IEEE , two categories exist: (i) message-type techniques that rely on frame types (e.g., a management or data frame must precede an acknowledgement) [22;38;43] and (ii) seqnum-based techniques that rely on sequence numbers (e.g., if frame 42 occurs right after frame 39, then frames 40 and 41 are missing) [49;53]. Applications of these techniques show attractive results [22;38;43;49;53;59]. In academic environments (laboratories, campuses, conference venues), the literature shows that individual sniffers exhibit completeness values between 70% and 80%. By merging traces, it is possible to reach values above 90%. But, as we will see in the following, we could never achieve such values in our experiments. 4.2 Datasets We study trace completeness using two datasets. They feature traces from multiple sniffers, each one equipped with three IEEE radio interfaces (ASUS EeePC 700 and Netgear WG111v3, see Figure 4.1). Interfaces listen on channels 1, 6, and 11. Each radio is set up in monitor mode and records every frame it hears regardless of the network the frame comes from. We merge then each sniffer s traces (on a per channel basis) using the WiPal software suite [25;26] Overview We measure both datasets in the same environment but at different times. They record wireless activity in the computer science laboratory building of Université Pierre et Marie Curie. It spans four floors of a twelve-floor building mostly occupied by private companies. It is located in Paris way outside the university campus. We refer to the datasets as follows:

60 Datasets Duration 1h13 2h10 Size 1 3 GB / 578 MB 2.8 GB / 833 MB Data size 2 1 GB / 203 MB 210 MB / 82 MB Devices ESSIDs Access points Ad Hoc cells Sizes before/after one merges the dataset. Only includes IEEE frames and their payloads. 2 Data sizes before/after one merges the dataset. Only includes IEEE data frames and their payloads. 3 Each distinct MAC address in a frame s sender field accounts for a device. Table 4.1: Quantitative characteristics of the and datasets Eight sniffers. Traces last roughly one hour and were recorded on December 1 st 2008, starting around 3 p.m. All sniffers were located indoors on the same desktop Six sniffers. Traces last roughly two hours and were recorded on December 19 th 2008, starting around 11 a.m. In this dataset, due to other constraints, we split sniffers into three groups of two. All groups are located indoors in the same room, but each group is at a different spot in the room Preliminary analysis Table 5.1 presents some quantitative characteristics of the datasets. Despite not being very different in nature, traces display some unexpected differences lasts twice as much as , but its merged datasets is only one and a half times bigger. This difference of activity is probably due to the fact that more people are active during the afternoon than around lunch time. Also, is close to Christmas, thus we can expect some regular users to be on vacations at this time has way less data traffic than This confirms the previous point. Average management traffic rates are the same order of magnitude in both datasets, 1 but has an higher average data traffic rate (46 kb/s vs. 11 kb/s, all channels cumulated). This is why is not twice as big as Again, this confirms displays more user activity. 1 When cumulating traffic from all channels on merged datasets, has an average rate of 83 kb/s for management frames, while an average rate of 96 kb/s.

61 Chapter 4. Accuracy of wireless packet sniffing Channel 1 Channel 6 Channel 11 Channel avg. 15:00 15:30 16: Channel 1 Channel 6 Channel 11 Channel avg. 11:00 11:30 12:00 12:30 13:00 (a) (b) Figure 4.2: Number of MAC addresses each merged trace contains from its beginning to a given time. Contrary to table 5.1, which only accounts MAC addresses from frames sender fields, all fields containing valid MAC addresses are used. Also note that non-data traffic (management and control traffic) is unexpectedly high. Control traffic is negligible (less than 2% of all traffic) therefore this overhead is mostly management traffic. This is a sign that many networks share the medium, each network having its own traffic for management , which lasts twice as much as , also has twice as much distinct devices. This also holds for ESSIDs and access points. This is surprising because one should expect to discover most of the devices at the beginning of traces and then to have a curve that increases slowly (especially for ESSIDs or access points). Figure 5.2 presents growth curves. They effectively appear to be non-linear but sniffers discover a majority of devices long after the first few minutes and the growth curves are not that flat. Probably datasets do not last long enough so we can draw more conclusions about that. However, the fact that one is able to discover new networks after more than one hour is another sign that many distinct networks share the radio medium. Despite both datasets exhibit the same small number of ad hoc cells, cell IDs are different in each dataset. Two cells from however share a prefix with a cell from We believe these cells relate to temporary or test networks (e.g., mesh test 1, mesh test 2, or meshtest). As a summary, both datasets reflects the same environment under different usage conditions. The environment features a high number of networks, almost all of them being

62 Completeness evaluation: shortcomings infrastructure networks. Despite a crowded medium, displays sensibly less user activity than Completeness evaluation: shortcomings Several issues render completeness evaluation techniques inaccurate. Partly because of their strategies and partly because of some anomalies that occur in traces. In fact, existing estimation techniques assume strict conformance to the IEEE standard for all devices this is often not the case, as we will see later in this section. Analyses of our datasets reveal multiple shortcomings s and s individual traces exhibit unexpected completeness values between 10% and 15% (using a seqnum-based technique). Merging traces only raises these values by barely more than 1%. This is far from the expected 90%! Starting with this result, we made several observations. Estimation techniques assume the network is not congested. In a congested environment many frames fail to access the medium. This means that counting gaps in sequence number reveals transmission failures rather than sniffer losses. Note that the large number of access points in our traces supports the congestion hypothesis. It also suggests that the hidden terminal problem is likely to occur in a massive way. Seqnum-based techniques assume IEEE implementations generate correct sequence numbers. This is wrong in practice because: 1. Some access points wrap their counters at 2,048 instead of 4,096 [53]. How this affects estimation techniques is implementation-dependent. Possible effects include ignoring some relevant gaps or detecting invalid gaps with large values. 2. Some access points set their sequence numbers to zero for all frames (we observed this behavior during other minor experiments). 3. Some access points manage multiple virtual access points simultaneously and contain several such devices. In the ideal case, each virtual access point should maintain its own sequence counter (IEEE [37, p. 66, ] ). But in practice this is not true, which introduces invalid gaps (i.e., leading to underestimating completeness). At the time of this writing, no automatic technique exists to detect such anomalies. In particular, we see no straightforward solution for the third anomaly (single counter for multiple virtual access points). Nevertheless, once one detects the faulty stations, it is in theory possible to work around these anomalies. As an example, we detected that a single device in was responsible for a 5% underestimation of completeness.

63 Chapter 4. Accuracy of wireless packet sniffing 49 score(m k ) = min(o 1, o 2 ) m N Figure 4.3: Score of a single merge operation. m N is the last merge, i.e., the one that includes N sniffers. Note that when k > 2, m k 1 features frames from at least two distinct sniffer traces and thus it is expected that o 2 > o 1. Therefore in most cases score(m k ) = o 1 m N. Message-type techniques fail to detect series of missing frames. For instance, messagetype techniques cannot detect a missing data frame if its corresponding acknowledgement is also missing. We call a clear gap two consecutive frames from the same station that exhibit a gap in the sequence number and that are not interleaved with any frame that mentions this station (either as transmitter or receiver). Clear gaps are symptoms of missing frames that message-type techniques would not detect. In and , 81% and 89% of the estimated missed frames are due to clear gaps. In the famous sigcomm2004 traces [50], clear gaps represent 59% of the estimated missed frames. This means that message-type techniques fail to detect most of the missed frames. As a conclusion, one should use completeness estimation techniques with care. Messagetype techniques are likely inaccurate. Seqnum-based techniques might lead to good results, provided no congestion and strict IEEE conformance of all participating devices. In any case, uncertainty exists regarding the accuracy of these techniques. 4.4 Completeness and number of sniffers We now make a step forward and investigate the impact of the number of sniffers on the completeness of the dataset. To this end, we analyze subsets of our datasets with a varying number of monitors Methodology The goal is to evaluate the quality of a merged dataset with respect to the number of sniffers that compose it. We combine individual traces in groups of increasing size k, where k {2, 3,..., N} is the number of traces inside the group. Recall that N = 8 for and N = 6 for

64 Completeness and number of sniffers Figure 4.4: Successive computations of M k for N = 4. An arrow from x to y symbolizes the x y merge operation. Let M k be the set of groups of size k (i.e., merged datasets including traces from k sniffers). To compute M k, we proceed recursively from M k 1. For the sake of simplicity, we define the binary merge operation a b meaning the result of merging datasets a and b. This operation is theoretically symmetric and associative, and we assume that our trace merging algorithms hold these properties. Let us show an example with N = 4 (see Figure 4.4). The original traces are {a, b, c, d}. We first compute M 2 by merging each trace with each other (due to symmetry, we skip some operations, e.g., b a because we compute a b instead). We compute M 3 by merging each element of M 2 with the remaining traces. Again, we can skip some operations due to symmetry and associativity (e.g., we skip b c a because we compute c a b instead). We keep on performing this procedure until k = N. Note that computing M k involves ( N k ) merge operations. Also note that each merge operation produces a new merged dataset. Therefore, we assimilate in the following each merge operation with its output. In order to evaluate the quality of each M k, we attribute a score to each element of M k. We then compute M k s average score. Let m k be the merged dataset one wants to score, m k 1 the previously merged dataset, and t the new individual trace we want to add to m k 1. We have m k = t m k 1. Figure 4.3 depicts how we compute score(m k ). Basically, score(m k ) represents how many frames m k contains that would not have been taken into account if only m k 1 or t were considered. For better readability, we normalize this quantity with the

65 Chapter 4. Accuracy of wireless packet sniffing 51 Score (%) channel 1 channel 6 channel Score (%) Number of monitors channel 1 channel 6 channel 11 Figure 4.5: Evolution of scores w.r.t. the number of monitors. frame count of our largest merged dataset (so a score is a ratio between 0 and 0.5). The larger the score, the more useful the merge Results Figure 4.5 and Figure 4.6 present the results. For both datasets, we merge each channel individually. Each cell presents average values for a given set of individual datasets. We draw the following conclusions. Scores decrease with size. to add new sniffers. This is expected: the bigger the dataset, the less interesting it is Scores never reach zero. This is however unexpected: even with eight sniffers, each trace contains a small percentage of frames that do not exist in the seven others. Small merges are not that bad. Merges of size 2 are able to provide a significant portion of the datasets total number of frames (78% and 73% in average for both datasets). This indicates a large part of individual traces frames are shared among sniffers. This is also visible when looking at the average proportion of shared frames inside M 2. One needs many sniffers however to obtain a near-complete trace: at least 5 sniffers for sizes above 90%.

66 Completeness and number of sniffers Figure 4.6: Scores w.r.t. number of monitors and dataset. Each column represents a given channel of a specific dataset. Each row M k represents the set of sub-datasets of size k. Each cell contains a box whose size is proportional to the average number of packets inside the corresponding sub-datasets. Red (dark) parts of boxes represent average values of o 1 (see Figure 4.3). Pink parts (medium grey) represent average values of o 2. Numbers below boxes are average scores (in percents) with 95% confidence intervals.

67 Chapter 4. Accuracy of wireless packet sniffing 53 Individual sniffers display high variability. This translates into wide confidence intervals on the first row of Figure 4.6. For instance, a sniffer of accounts for 53% of all the dataset s frames while another accounts for up to 87% (results vary between 45% and 80% for ). Since some of these variations occur with sniffers next to each other, we conclude that sniffing processes exhibit high randomness. As a summary, despite most frames are heard by multiple sniffers, a few of them are difficult to receive. This means that each sniffer s traces contain most of the dataset s frames but also some original frames. Therefore, researchers should use techniques that are robust to frame losses as they are unavoidable no matter the number of sniffers. 4.5 Conclusion Our analyses reveal that traditional completeness estimation techniques have several shortcomings, making them unreliable. Even when using eight sniffers on the same desktop, there exist frames only recorded by one sniffer. This suggests some other frames were left unrecorded despite we use more sniffers than used in typical settings. Several extensions are possible to this work. We plan analyzing underloaded environments with more monitors. One could also focus only on networks with good reception. Finally, it could be interesting to look for other completeness estimation techniques, to differentiate among transmission failures and frame losses.

68 Conclusion

69 Chapter 5 Empirical analysis of Wi-Fi activity in three urban scenarios ABILITY to study arbitrary environments is one of the motivations that led to developing WiPal. More specifically, we are interested in environments where no network traces are publicly available. This is why, in this chapter, we record and analyze traces from three environments with different sociological means: an office, a dense uptown residential area, and a sparse suburban residential area. Contrary to existing studies, we do not focus on a single network, but on the overall network activity. We study the behavior of devices rather than traffic characteristics. We are interested in observations like the total duration a device is active, the frequency of appearance of new devices, and activity that can be extracted from traces. It is usual that a sniffer faces radio range limitations and high packet losses; nevertheless, analyzing traces provides important insights into the activity of a given wireless environment as perceived by the wireless adapters. This work is a joint-work with Mathias Boc [15]. We carried it out during our respective PhD theses. Many papers actually use wireless sniffing as a monitoring technique. For instance, Cheng et al. propose Jigsaw [22], a large scale monitoring system based on sniffing. However, despite being powerful and scalable, Jigsaw imposes some constraints on monitors that make it unpractical in a number of environments. Researchers often use sniffing as a means to diagnose network problems [23], enhance security [12], or analyze communication protocols [43;59]. Nevertheless, as far as we know, authors using sniffing do not study user behaviors. In fact, some papers analyze the behavior of users with other techniques, but most of them focus on specific environments. They typically rely on traces collected from a given network s logs [13;35;42;56]. In this way, their methods are not applicable when several independent networks cover the target area, or when it is unfeasible to access the network infrastructure. It is interresting to note however that some of these papers study largescale networks. Especially, Afanasyev et al. [9] use such a technique on a city-wide network with several types of users (broadband access, 3G cellular, and commercial). Some papers 55

70 Setup rely on giving to volunteers dedicated devices that measure contacts with other devices [36]. Typically, experiments concern a few dozen protagonists for a few days, which is the main limitation of this technique. To the best of our knowledge, only González et al. [33] studies human mobility in a large environment, but they focus on real mobility rather than user behavior as seen from IEEE networks. 5.1 Setup We perform our analyses on traces collected in three different environments. We obtain each trace using a sniffer (laptop) equipped with three IEEE radio interfaces (ASUS EeePC 700 and Netgear WG111v3, see Figure 4.1 in the previous chapter). The interfaces listen on channels 1, 6, and 11. Each radio is set up in monitor mode and record every frame it hears regardless of the network. 1 We refer to the three traces as follows: Office. This is a three-day-long trace recorded in the computer science laboratory of Université Pierre et Marie Curie Paris 6. The laboratory spans three floors of a twelve-floor building that is also occupied by some private companies. Residential, sparse. This is a three-day-long trace recorded in a suburban residential area. The area is crowded only with small habitation buildings and houses. Residential, dense. This is a ten-day-long trace recorded uptown. The area is mostly residential but includes shops and schools. Tall towers compose habitation buildings. There is a high car and pedestrian traffic. Table 5.1 presents quantitative characteristics of these traces. As expected, the office trace has the greatest number of devices, ESSIDs, 2 and access points (AP). It has more access points than ESSIDs, which means that some wireless networks span multiple access points. The office trace also contains beacons from a relatively high number of ad hoc networks. This comes mostly from unconfigured devices (e.g., printers and Free Public WiFi ) and devices that create a network they expected to find (e.g., AT&T Wireless ). The same reasons make the dense residential trace include information on ad hoc networks. The sparse residential trace, as expected, has the smallest number of devices. It has more access points than ESSIDs; this is due in part to hidden ESSIDs (5 APs hide their ESSID, and we expect them not to belong to the same network) and to Internet boxes that advertise shared ESSIDs belonging to network operators (e.g., for Wi-Fi phone service). This trace includes however two surprising features: 1 Despite not available yet, we plan to make these traces public as soon as possible. 2 ESSID s are strings used as network identifiers. A single network might include multiple access points, but has only one ESSID.

71 Chapter 5. Empirical analysis of Wi-Fi activity in three urban scenarios 57 Office Residential, sparse Residential, dense Duration 3 days 10h 3 days 12h 10 days 15h Size GB 3.67 GB 1.61 GB Data size GB 4.75 MB MB Devices ESSID s Access points Ad Hoc cells Sizes only include IEEE frames and their payloads. 2 Data sizes only include IEEE data frames and their payloads. 3 Each distinct MAC address in a frame s sender field accounts for a device. Table 5.1: Quantitative characteristics of the Office, Residential sparse, and Residential dense traces. 1. Out of the 3.67 GB that compose the sparse residential trace, only 4.75 MB are data frames! 98.7% of frames in the sparse trace are access point beacons, which suggest these networks exist but are just unused in practice. We believe that they are default-provided with network operator boxes, but that most people access the Internet using wired links to their boxes. 2. The sparse residential trace is bigger, has more access points and ESSIDs than the dense residential trace. In fact, the sparse residential trace is bigger because its sniffer has more networks in its vicinity. This means that access points frames account for most of a trace s size. This is however surprising that the sparse trace has more networks than the dense one. This might be due to differences in Wi-Fi signal propagation in each area (making it easier to hear far networks in a sparse environment) or to social differences in populations composing the neighborhoods. 5.2 Device diversity This section investigates two sources of device diversity: cumulated activity durations and growth of the number of devices. The term device refers to any IEEE station. This typically concerns human-operated computers, but also access points and Wi-Fi printers. The reason we study these two characteristics is twofold. First, we want to investigate who exactly uses the wireless medium at given periods and locations. Second, device diversity is relatively easy to compute even in the presence of huge frame losses. This is important because we record each trace using wireless sniffing in areas that are unfriendly to this technique (due to interferences and the presence of multiple walls). In this regard, the sparse residential trace is the worst: by looking at frame sequence numbers, we observe that the

72 Device diversity Channel 1 Channel 1 Channel 1 1 day 1 h 15 min 3 min 1 day 1 h 15 min 3 min 1 week 1 day 1 h 15 min 3 min Total activity duration 1 day 1 h 15 min 3 min Channel Total activity duration 1 day 1 h 15 min 3 min Channel 6 Total activity duration 1 week 1 day 1 h 15 min 3 min 0 Channel Channel day 1 h 15 min 3 min day 1 h 15 min 3 min Channel 11 Channel 11 1 week 1 day 1 h 15 min 3 min Devices (sorted by total activity duration) (a) Office Devices (sorted by total activity duration) (b) Residential, sparse Devices (sorted by total activity duration) (c) Residential, dense Figure 5.1: Distributions of cumulated activity durations. trace lacks 85% of the frames. We estimate that the office trace has a missing frame ratio of 70%, and the dense residential area trace a missing frame ratio of only 4%. This small value is due the fact the trace features a very small number of active networks, which means that the sniffer ensured very good reception for the predominant network. We first analyze the distribution of cumulated activity durations and then the growth of the number of devices Cumulated activity durations Figure 5.1 plots the distribution of cumulated activity durations among all traces and channels. Each impulse maps a single device to the total duration of its activity inside the trace. We consider that a device is active when it emits a frame within a window of three minutes (any type of frame: management, data, or control). We use the thee-minute threshold because access point drivers use activity timers with similar values (e.g., MadWifi drivers use timers varying from 30s to 5min). Requiring one frame within a window of a few minutes makes the technique resilient to frame losses. A few features are common to all traces: Devices are unevenly distributed among channels. In all traces, more devices appear on channel 11 than on channel 6, and more devices appear on channel 6 than on channel 1. This is a direct consequence of networks being unevenly distributed among channels (both in ad hoc and infrastructure modes).

73 Chapter 5. Empirical analysis of Wi-Fi activity in three urban scenarios 59 Device activity has a highly uneven distribution for a given trace and channel (note the logarithmic scale on Figure 5.1). We can classify devices in three groups: (1) devices that are (almost) always active, (2) devices that appear only once, and (3) other devices. Among all traces, a sum of 31 devices (out of 2,395) belong to class (1) of these devices seem to be access points. Two of the four remaining devices appear in the office trace, and two of them in the dense residential trace. The remaining devices emit no beacons, so they are not in ad hoc mode. It is interesting to underline that a handful of users always leave their devices on. A significant portion of devices belong to class (2) (20% in the office and dense residential traces, 9% in the sparse residential trace). This means that many users are not regular and just pass by. Class (3) is diverse and includes the whole range of possible duration values. However, the smaller the duration, the higher the probability. Most devices are nearly inactive. Varying with the trace and channel, 48% to 96% (76% on average) of the devices are active for less than one hour during the whole duration of measurements. Therefore, a majority of devices are inactive most of the time. There are points however where traces are different and include specific features. The office and sparse residential traces have similar shapes, but devices in the latter tend to cumulate longer activity durations. The dense residential trace has a shape that include a visible cut between very active and nearly inactive devices. These variations are noticeable through the average durations: 2h36min for the office trace, 11h48min for the sparse residential trace, and 2h21min for the dense residential trace (keep in mind this trace is three time as long as the two others). Therefore in some environments, in average, devices tend to be active longer Growth of the number of devices Figure 5.2 plots the growth of the number of devices. Each curve corresponds to a given trace and channel (plus a curve for each trace that represents the average of the three channels). Each point shows the number of devices a given [trace, channel] pair features from its beginning up to the corresponding timestamp. We consider that each MAC address represents a device, and look for MAC addresses in all address fields of the frames. Some devices are mentioned as destinations but never as transmitters. That explains why we discover more devices than indicated in Table 5.1 and Figure 5.1. Furthermore, due to a subtlety in the IEEE protocol, some address fields of the frames may contain values that are actually not real MAC addresses (e.g., independent BSSIDs). We ignore these values. We can derive a number of interesting observations from Figure 5.2: The repartition of devices among channels is uneven. Furthermore, it does not always correlate with the repartition of sending devices among channels. In all traces, less devices 3 A device that appears on multiple channels is accounted multiple times.

74 Device diversity Channel 1 Channel 6 Channel 11 Channel avg Channel 1 Channel 6 Channel 11 Channel avg Channel 1 Channel 6 Channel 11 Channel avg day 08 day 07 day 06 day 05 day 04 day 03 day 02 day 01 0 day 01 day 02 day 03 0 day 04 day 05 day 01 day 02 day 03 day 04 day 05 day 06 0 day 12 day 11 day 10 day 09 (a) Office (b) Residential, sparse (c) Residential, dense Figure 5.2: Number of distinct MAC addresses each trace contains from its beginning to a given time. appear on channel 1 than on any other channel. This is perfectly consistent with previous results (cf. Section 5.2.1). Nevertheless, channel 6 attracts more users than channel 11 in two of the three traces. This contradicts the channel repartition of Figure 5.1. The difference is that Figure 5.1 only considers devices that emit frames while Figure 5.2 considers all types of devices. This indicates that it is difficult to evaluate the repartition of users among certain channels. The discovery rate follows a day-night pattern. Curves periodically alternate between flat and growing periods. Depending on the trace, this effect has varying amplitudes and periods, but is visible in all traces. Flat periods occur during nights, usually starting around midnight and stopping a few hours before noon. This shows that, as expected, devices activity correlates with human activity. In the office and dense residential area, the discovery rate is constant during long periods. Furthermore, in the dense residential trace, this still holds after a week of measurement. On the other hand, the sparse residential trace flattens drastically after two days. We believe that this is a consequence of the type of environment: high mobility is expected in uptown streets and offices, as well as a high turnover of people. We can expect many new users will not come back before the measurement ends. Therefore this also explains why the average activity duration per user is higher in the sparse residential trace (see the end of Section 5.2.1). Note that, however, even when the discovery rate falls after two days, it is still possible to discover new users near the end of the trace. Among the different observations derived in this section, we believe that two of them are of particular importance. First, as shown by the study of activity durations, users are mobile

75 Chapter 5. Empirical analysis of Wi-Fi activity in three urban scenarios h34h 10 0 P[Inter-activity > t] Distribution (t + t 0 ) -α e -βt/k 1min 1h 24h 72h Time t (seconds) Distribution (t + t 0 ) -α e -βt/k min 12mins 24h 72h Time t (seconds) Distribution (t + t 0 ) -α e -βt/k 2h Time t (seconds) 24h 1 week (a) Office (b) Residential, sparse (c) Residential, dense Figure 5.3: CCDFs of aggregated inter-activity times of all devices for the three traces. The distributions are well fitted by truncated power laws with exponential decays. The parameters of the distributions are presented in the text. or do not generally keep their Wi-Fi equipments switched on. This translates into packet traces where most devices are inactive most of the time. Second, different environments have different impact on mobility. This translates either into new user apparitions being evenly spread inside traces or, on the contrary, grouped at the beginning of traces. 5.3 Activity/Mobility Behaviors This section analyzes the type of relationship devices develop with their environments. Behind the notion of relationship, we are interested in understanding how device activity evolves. We highlight predominant patterns when they exist with the objective of characterizing the importance of locations on the behaviors of the devices. Because of our centered vision of space (each time we only use one sniffer), it is difficult to extract physical mobility behaviors from traces. In some situations, however, temporal activity patterns give insight on devices mobility: either a device is no more in the considered space or it is back and active. Statistical tools exist to extract mobility patterns information. We take advantage of them and rely on the available activity information Inter-activity patterns In this first part, we analyze the devices rhythm of activity. For this purpose, we represent the aggregated complementary cumulative distribution function (CCDF) of the interactivity times (see Figure 5.3). The inter-activity time is the time gap between the beginnings of two consecutive periods of activity. Therefore, the duration of activity is included in the inter-activity time. We start by presenting the distribution parameters and, for each trace,

76 Activity/Mobility Behaviors we investigate the meaning of variations when they exist. Note that only devices that are active at least twice are represented here. We can approximate the CCDFs of the three traces by truncated power laws with exponential decays: P(t) = (t + t 0 ) α exp( βt/k), (5.1) For the office trace t 0 = 1 minute, α = 0.40, β = 1.2, and k = 24 hours (Figure 5.3(a)). The parameters for the residential sparse trace are: t 0 = 1 minute, α = 0.45, β = 1.40, and k = 24 hours (Figure 5.3(b)). For the residential dense trace t 0 = 15 minutes, α = 0.40, β = 0.8, and k = 24 hours (Figure 5.3(c)). The power law part of the distributions shows a slope that is very similar to recent experimental results found in the literature [19;41]. It counts for a large proportion of the inter-activity times: 98.3% for the office trace, 99.2% for the residential sparse, and 92% for the residential dense. For the three distributions, k is almost the same which can point out on a possible cycle or period of one day (the characteristic time in [41] ). The value β is around 1.3 for the first two traces which indicates a strong contraction of the probabilities of activity after 24 hours. For the residential dense trace, this value is lower (0.8) which here indicates a greater disparity of the probabilities. Partly due to the longer duration of the trace, it is important to note that there are no strong variations and thus no coordinated behaviors among the devices. Finally, we can note that the parameter values are very similar for the office and residential sparse distributions which indicate that these locations might have the same level of influence on devices behaviors (constraint, necessity, social habits). Concerning the variations in the distributions, we observe three main steps in the distribution of the office trace: the first around 1 hour, a second around 24 hours, and the third around 48 hours. According to the characteristic time k equal to 24 hours, we can suppose a periodicity of one day for a large part of the devices and, one of 48 hours for a smaller part. The first variation around 1 hour is difficult to interpret because of its small length but may have a link with different pauses in the activity along their presence in the environment. The residential sparse distribution presents four main steps: the first around 12 minutes, the second around 14 hours, the third around 24 hours and the fourth around 34 hours. The first variation around 12 minutes is particularly interesting. After verification in the trace, this duration corresponds to a handheld mobile device that is programmed to check mails every 15 minutes (the observed 12 minutes plus the 3 minutes of activity duration granularity). Of the three other variations, the one around 24 hours concentrates a greater proportion of probability. With a characteristic time of 24 hours, it points out a daily periodicity. Following the same logic, the two other variations also point out a strong periodicity of 24 hours but time-shifted by 14 hours and a periodicity of 34 hours that collect less devices than the 24-hour period. Compared to the other traces, the residential dense distribution does not present clear wide variations. Although the characteristic time is also around 24 hours, there is no vari-

77 Chapter 5. Empirical analysis of Wi-Fi activity in three urban scenarios 63 Proportion of Active users Time t (hours) 0 Time t (hours) 0 Time t (hours) (a) Office (b) Residential, sparse (c) Residential, dense Figure 5.4: Proportion of users that are active each time interval relatively to the first time (interval) they appeared for the three traces. In these traces, we observe a clear periodicity of 24 hours with some variations that are characteristic of the social meaning of each environment. ation around this value. In this situation it is difficult to judge if really there are no coordinated behaviors among devices. To confirm the observations of periodicity, we analyze, in the following, from a different point of view the device activity behaviors Predominant activity pattern With a long-term scope, we now investigate and extract, if it exists, the predominant pattern that defines each context with the properties that characterize it. There are different means to address this issue. Our approach is to simply consider the activity of each device by slicing the observation period in time intervals of equal durations and by aggregating the activity patterns in each of these time intervals. More specifically, for each device we mark the set of time intervals where it has been active relatively to the first time interval when it appeared in the environment (which is set to 0). For each time interval, we compute the number of devices that were active to obtain the different proportions. With this method, the proportion of active devices at the time interval x indicates that a certain number of devices has been active x time interval(s) after the first time they have been seen for the first time. Therefore, a peak at each kx (with k > 0) could point out on a possible coordinated activity and periodicity of behaviors. Here, we set time intervals of 1 hour and plot the results in Figure 5.4. We start by analyzing the results obtained for the office trace (Figure 5.4(a)). As we can observe, the figure presents clear peaks each 24 hours, which indicates daily periodicity in the behaviors. The decrease in the proportions is due to the new devices that have not been active during the whole period of activity. Therefore, their activities are mostly visible in the first part of the figure. The second observation is that around the peaks, the proportions

78 Conclusion remain high during a period of about 8 hours and decrease abruptly. Hence, there is a real coordinated movement of a large proportion of the observed population in this context. The office constraints and schedules can explain this phenomenon and then, this predominant behavior can be judged as representative for this type of environment as a large parts of the population are workers. Contrary to the office trace, the residential sparse one presents a different pattern with interesting properties (Figure 5.4(b)). If we start with a periodicity of 24 hours, the pattern presents peaks every 24 hours, which confirms this (expected) periodicity. However, we also observe another period of 24 hours but time-shifted by 10 hours from the start. To summarize, globally, devices are active 10 hours after the first time of activity and 14 hours after with a periodicity of 24 hours. This phenomenon might have different means. The most related to the residential environment is the diurnal activity where an office-like pattern is subtracted. Devices are active early morning and early night. In this situation, the gap of 10 hours corresponds to night periods when devices are not active and the gap of 14 hours, when devices are away from home. Therefore, there is a real complementary link between the two environments. Compared with what we know from the networking literature, where most of mobility/activity behaviors come from university campuses, the activity pattern we observe here is clearly new and different. However, if we are able to extract a predominant pattern from this residential (sparse) environment, we have a different pattern for the residential (dense) context. As mentioned in Section 5.1, the residential dense trace has been obtained uptown while the residential sparse one is suburban. In a suburban residential environment, the proportion of observed devices that may have a relationship with the environment is important because of the high proportion of homes. Uptown, the presence of shops, schools, and other concentration points may introduce a proportion of devices that do not have any relationship with the considered location. With these elements in mind, we observe that it is difficult to extract a predominant pattern from the results of the residential dense trace (Figure 5.4(c)). If, as the residential sparse trace, we consider, a priori, a periodicity of 24 hours, there are peaks that confirm this periodicity but with the same proportion than other that occur irregularly. In this situation, a classification of the devices could be interesting to better understand the different relationships that exist in this environment. Although we let this study as future work, we should be able to detect and analyze yet unseen population category of householders, and more traditional ones such as workers, commuters, and visitors. 5.4 Conclusion This chapter analyzes behaviors of Wi-Fi users in three different locations that have distinct social meanings. With our sniffing technique, we are able to provide a more complete view

79 Chapter 5. Empirical analysis of Wi-Fi activity in three urban scenarios 65 Figure 5.5: Sniffer locations regarding the collection of traces inside the Parc Monceau. The subsequent trace analysis is currently in progress. (Background from Google Maps.) of the population moving in a given location and highlight important aspects of what can be found in real situations. In particular, we notice that: (i) in popular places, the rate of discovered users can increase almost linearly within the window of observation, (ii) regular users count for a very small portion of the total population, (iii) user activity highly varies from scenario to scenario, and (iv) the location plays a role on the presence duration. Related to these aspects, our study also leverages open issues as how to distinguish the population for which the considered location has a social meaning and how the device can understand in what kind of environment it is currently in. In order to extend this study, we are currently analyzing traces from multiple monitors we collected in a Parisian park, the Parc Monceau. This park s Wi-Fi activity interests us because it includes several access points spread at various locations. We used ten monitors and measured an area about half the park wide, during one hour (see Figure 5.5). Our analyses are in progress therefore we only have few results for the moment. Traces include 138 emitting devices, 71 of which are Apple devices. We believe these are mostly mobile devices (iphone or ipod touch). With such a number of mobile devices, it is possible that traces reveal unseen usage patterns.

80 Conclusion

81 Chapter 6 Conclusion and future work WIRELESS sniffing is a powerful technique to measure activity in Wi-Fi networks but suffers from a number of issues. These are both pragmatic and theoretical. First, existing software to handle IEEE packet traces is not satisfying. In general, available software has not been designed for reusability. Thus, developing new tools requires starting from scratch. There is also a lack of efficient and flexible merging tools. Second, several issues exist regarding the relevance of wireless packet traces. Wi-Fi sniffers inherently miss some frames and therefore it is essential to evaluate the number of missed frames (i.e., the completeness of traces). Most studies involving wireless sniffing do not focus on Wi-Fi usage patterns. Other studies use it only in specific environments such as laboratories, campuses, or conferences. This thesis addressed the aforementioned issues. We first develop WiPal, a framework to help process IEEE packet traces. WiPal includes a flexible trace merger. Through the analysis of two short-lived traces, we studied the accuracy of completeness evaluation techniques, and the impact of adding new sniffers on trace completeness. A final study collected and exploited three long-lived datasets in different environments to study Wi-Fi usage patterns. 6.1 WiPal WiPal s design includes several software patterns that are relevant to packet trace processing. Since packet traces are basically streams of packets, using a pipe and filter pattern enables users to have a modular approach of trace processing. This allows for an easy parametrization and maintenance of existing algorithms. Many algorithms also need to access specific fields of IEEE frames and thus need to embed a parser. WiPal provides a solution that uses static callback functions to combine performance and reusability. WiPal also includes original features that cannot be found in other tools. Among them are random access to a packet trace and trace aggregation. Evaluation shows that most of WiPal s fea- 67

82 Wi-Fi sniffing accuracy tures have marginal costs on its performance, and thus WiPal does not trade performance for reusability. Some of WiPal s utilities run faster than other state-of-the-art programs. WiPal features a library and tools to carry various miscellaneous operations (such as comparison, concatenation, or hexadecimal dumping), statistics extraction, or anonymization. WiPal also includes an innovative offline trace merger. This merger includes original algorithms with regard to reference frame extraction and trace synchronization. A study shows its synchronization algorithm offers better performance and better accuracy than previous algorithms. WiPal s merger supports more input formats than any other Wi-Fi packet trace merger. Contrary to other tools, using it is straightforward and does not require setting up database backends or time servers. A performance evaluation also shows it is an order of magnitude faster than Wit, the other offline trace merger. 6.2 Wi-Fi sniffing accuracy In order to gain further insight into the completeness one can expect from Wi-Fi sniffing, we collected two short-lived datasets involving six and eight sniffers. Possibly due to congestion, these datasets exhibit a lower completeness than expected. A careful analysis reveals, however, that the existing evaluation techniques suffer from a number of issues. First, techniques based on analyzing message types are not accurate. Second, some Wi-Fi devices do not conform with the IEEE standard and might skew the results of techniques based on analyzing sequence numbers. Finally, all the existing techniques are not accurate when the network is congested. We then go further into our analyses and study the impact of the number of sniffers on trace accuracy. To this end, we vary the number of sniffers we use from a given dataset (starting with a single trace and then adding traces one after another until we use all the traces from the dataset). We find that, despite most frames are heard by multiple sniffers, a few of them are difficult to receive. In other words, each sniffer s traces contain most of the dataset s frames (between 45% and 87% in our traces) but also some original frames. This is true even when using eight sniffers sharing the same location. We argue that researchers should use analysis techniques that are robust to frame losses. 6.3 Wi-Fi activity In a last study, we deploy Wi-Fi sniffers in three distinct environments with different sociological meanings: an office space, a sparse suburban residential area, and an dense uptown residential area. Inside these environments, we focus on Wi-Fi usage patterns. We focus on the whole traffic rather than a single network. The traces we collect last three days, except for the dense residential trace that lasts ten days.

83 Chapter 6. Conclusion and future work 69 All the three traces exhibit a number of differences. Among the residential traces, the biggest one carries almost no data traffic but includes more than ten distinct access points. The other residential trace, on the other hand, only mentions four access points but has a significant part of its traffic dedicated to data. This reveals that access point frames (and most notably management frames) account for most of a trace s size. Also, some environments include networks that are configured but not used. Another interesting feature is device discovery. While the office and the dense residential traces display an ever-increasing number of discovered devices, the sparse residential trace flattens after two days. This reveals that some environments display users with higher mobility and a high turnover, but this behavior is not universal. Environments also display complementarity among each other. While the office space have one peak of activity per day, the residential environments display two peaks spaced by ten hours. This reflects how people use their Internet connection before and after going to work. Finally, the traces share a number of features. In all environments devices are unevenly distributed among channels. Channel 11 always includes more devices than channel 6, and channel 6 always includes more devices than channel 1. Datasets also feature day-night patterns (but this rather expected, as device activity reflects human activity). Also, inside every trace, only a very small portion of devices appear regularly. Finally, all traces include ad hoc cells. 6.4 Perspectives As WiPal is a framework to help developing new tools, several perspectives exist regarding its extension. A first natural step would be to implement new protocols and filters in order to obtain a tool with a more general purpose than IEEE traces. WiPal already includes a few features regarding Ethernet, IPv4, and IPv6, but these are not at the same level as its handling of IEEE Merging is also available as an experimental feature for some IP packet traces. Such generalizations are good proofs that WiPal s design is not specific to Wi-Fi and suits any protocol. It would also be of interest to exploit the modular nature of WiPal, and develop multiple implementations of some features. For instance WiPal could include multiple synchronization algorithms or multiple anonymization functions. Because WiPal uses static C++ techniques, several of its features require writing cumbersome code. One should also carry research with this regard to make these features nicer to use. Another issue with WiPal is that its large code base makes it difficult to check for correctness of operation. Despite most algorithms WiPal implements are simple, the C++ techniques it uses and the number of interactions involved make the code difficult to check. As a consequence, many test-cases were developed to this end. It would be interesting however to study if WiPal could be formally proved. At the end, WiPal could become a generic frame-

84 Perspectives work for handling packet traces, including algorithms at every level: from packet traces input/output to trace algorithms, and including parsers for a number of protocols. On the accuracy of wireless sniffing, our analysis raises a number of questions. First, we are not sure congestion is the source of the poor completeness in our traces. Furthermore, it would be unexpected that the CSMA mechanism of IEEE generates such significant losses. 1 Further controlled experiments with this regard is desirable. Maybe these losses are due to our setting (sniffers close to each other, with less-than-average capability, or even the specific traffic characteristics of the environment). It would be interesting to develop more experiments with different settings (e.g., different network adaptors, focusing on a single channel, or reducing interferences) This could also give more insights into how the number of sniffers impacts accuracy. With this regard, maybe a good thing to do would be to investigate why sniffers exhibit such a variable accuracy. We also show that the existing completeness evaluation techniques have some weaknesses. Although some of them probably cannot be worked around (it might be impossible to distinguish between sniffer losses and transmission failures), it would be interesting to develop techniques that fix the others (e.g., automatically detecting some non-conform behaviors regarding sequence numbers). Active experiments could also be of interest to evaluate how inaccurate each evaluation technique is. Our analysis of Wi-Fi activity in urban environments also raises questions. First, it is possible that wireless sniffing introduces a bias in our results. For instance devices located far from sniffers are likely to be seen less often than near devices, and it is unclear how this impacts their calculated active duration (even though we tried to mitigate such a phenomenon using an activity period of three minutes). At this stage, it is also unclear how the information we discovered could be of use to others (e.g., researchers, application designers, software engineers, or hardware vendors). The uneven repartition of devices among channels probably argues for commodity access points to include algorithms to dynamically switch among channels. Maybe the periodicity in device behaviors could be of some use to people designing opportunistic networking schemes. Another concern is the generality of our results. Since the tools now exist to perform wireless sniffing in any environment, it would be of interest to perform more experiments, both in similar urban environments but also in others. To this regard we collected some traces in a Wi-Fi enabled park. 1 CSMA (Carrier Sense Multiple Access), is a type of MAC scheme used by IEEE

85 Appendices 71

86

87 Annexe A Résumé de la thèse en français LE standard IEEE [37] définit des couches de base pour des communications sans fils. Il est apparu il y a environ une dizaine d années, sous la marque Wi-Fi, et il est largement utilisé aujourd hui. Les ordinateurs personnels qui effectuent des communications Internet sur des liens radio utilisent quasiment exclusivement ce protocole. Wi-Fi joue également un rôle majeur dans beaucoup d équipements mobiles : on le trouve dans des PDA, des téléphones, des baladeurs, même dans certains appareils photo. En conséquence, Wi-Fi fait parti du paysage de l informatique ubiquitaire [58]. Avec l aide d autres protocoles comme Bluetooth ou GSM, on l utilise pour créer un environnement numérique transparent, intégré à notre vie quotidienne. Par exemple, des points d accès Wi-Fi (hotspots) équipent les foyers, les hôtels, les salles de conférences, ainsi que bien d autres lieux. C est pourquoi il est essentiel de comprendre comment les implémentations du standard IEEE se comportent sur le terrain. Cette connaissance est nécessaire pour développer de nouvelles applications et de nouveaux protocoles, ou pour améliorer ceux qui existent. A.1 Contexte IEEE spécifie une couche physique (PHY) et des règles d accès au médium (MAC 1 ) pour un réseau sans fils. La PHY est en charge de coder et de décoder l information sous forme numérique (des séquences de bits) vers et depuis un signal radio. La MAC, d un autre coté, coordonne les transmissions de sorte à ce que chaque station puisse partager le médium sans interférer avec les autres. Bien qu il s agisse principalement d un standard poussé par les entreprises, les chercheurs ont produit une grande quantité de travaux au sujet de IEEE Cela inclut des sujets très spécialisés, qui se concentrent par exemple sur la PHY [30;45], la MAC [46], ou d autres fonctionnalités comme par exemple la sécurité [12;14]. Mais d autres sujets de re- 1 MAC signifie Media Access Control. 73

88 74 A.1. Contexte FIGURE A.1 Sniffing sans fils : des moniteurs passifs écoutent l activité radio au sein de la zone de mesure. cherche plus généraux impliquent ce protocole : les réseaux ad hoc et les réseaux mesh [10;27], les réseaux de capteurs [60], ou encore l informatique ubiquitaire [58]. Bien comprendre le Wi-Fi bénéficie donc à tous ces domaines. Pour atteindre cette compréhension, des analyses théoriques aussi bien que des études expérimentales sont nécessaires. Cette thèse se concentre sur l aspect expérimental, et en particulier sur les mesures de terrain des réseaux sans fils. A.1.1 Mesures passives Wi-Fi et sniffing Chaque technique de mesure d un réseau est soit active soit passive. Les mesures actives modifient le trafic réseau de sorte à évaluer certains paramètres. Des techniques actives classiques consistent par exemple à saturer un lien pour évaluer sa capacité, ou à envoyer des sondes pour évaluer les délais aller-retour. À l opposé, les mesures passives n interfèrent, pas avec le trafic réseau. C est le cas, par exemple, lorsque l on écoute sur un lien pour analyser son trafic. Les techniques passives peuvent toutefois interférer avec l infrastructure : elle peuvent nécessiter des utilisateurs d installer un logiciel spécifique, ou des administrateurs de brancher des équipements d écoute particuliers. Une technique passive classique pour mesurer des réseaux sans fils est le sniffing. Cela consiste à répartir des moniteurs au sein de la zone de mesure pour qu ils capturent tout le trafic qu ils pourront entendre (voir Figure A.1). Les moniteurs produisent des traces qui sont des successions de paquets MAC (des trames). Le sniffing est une étape fondamentale dans un certain nombre d opérations réseaux, comme par exemple le diagnostique [23;34], l étude de la sécurité [12;48], et l analyse des comportements des protocoles [22;39;43;59]. Bien que cela ne soit pas obligatoire, il peut aussi servir de support à des systèmes de localisation [20;21;61]. Il existe beaucoup de configurations de sniffing différentes : il peut y avoir un seul ou plusieurs moniteurs, ceux-ci peuvent être constitués de matériel courant ou

89 Annexe A. Résumé de la thèse en français 75 spécialisé, et ils peuvent fonctionner d une manière isolée ou en étant relié à une infrastructure filaire (entre autres paramètres). En revanche, dans tous les cas, l opération de mesure est passive, non-intrusive, et n interfère pas avec l opération normale du réseau. Le sniffing sans fils utilise souvent une procédure centralisée qui permet de fusionner les traces [22;43;59]. L objectif est d abord d avoir une vision globale de l activité radio à partir de plusieurs mesures locales. En utilisant des moniteurs avec des zones de couvertures qui se chevauchent, il est également possible de compenser les pertes de certains moniteurs en utilisant des données d autres moniteurs. Mais cette fusion est une tâche difficile ; elle nécessite une synchronisation très précise des traces (de quelques microsecondes) et une prise en compte de la nature peu fiable du canal radio (les pertes de trames sont inévitables). A.1.2 Questions ouvertes Le sniffing soulève néanmoins un certain nombre de questions ouvertes. Dans cette thèse, nous nous concentrons sur les aspects de technique informatique 2. Nous les classons dans deux catégories : les questions au sujet de la technique en elle même, et les questions au sujet des outils. Cette thèse se préoccupe des deux, dans un effort pour collecter de nouveaux jeux de données et produire des analyses originales. Les questions au sujet de la technique sont relative à la pertinence des traces produites. Par exemple, au sujet de la précision des moniteurs. Même dans de bonnes conditions radio, ceux-ci peuvent rater des trames qui ont pourtant été transmises avec succès. Dans ce contexte, il est une question naturelle : puisque les traces de chaque moniteur sont incomplètes (c est à dire que certaines trames ont été perdues) il est probable que la fusion de ces traces soient également incomplète. Quelle précision est-on en droit d attendre d un moniteur? De plusieurs moniteurs? Quels résultats peuvent être tirés de traces incomplètes? Une autre question concerne la pertinence des jeux de données disponibles. Alors que le Wi-Fi est presque omnipotent, la plupart des jeux de données rendus publics par les chercheurs concernent des campus d universités, des laboratoires, des lieux de conférences [2]. C est en partie parce que la pratique courante est de se concentrer sur des environnements facile d accès pour des chercheurs, mais aussi parce que les techniques de mesure qui existent ne marchent que dans certains scénarios. La plupart de ces techniques se concentrent sur un réseau unique, ou bien nécessitent de mettre en place une infrastructure complète, ou bien sont intrusives vis à vis des équipements réseaux. Lorsque l on se retrouve dans la rue ou dans la maison d un particulier, ces techniques sont donc difficiles à mettre ne pratique. Pourtant, le sniffing sans fils a un très fort potentiel pour mesurer n importe quel type d environnement : il est passif, il n interfère pas avec l infrastructure, et dans certains cas il ne nécessite pas de mettre en place une infrastructure. Mais ce potentiel est resté inexploité jusqu à présent. En conséquence, les chercheurs se concentrent sur l étude d anoma- 2 Certains aspects ne sont pas directement informatiques. Par exemple, le sniffing soulève des questions d ordre juridique et éthique.

90 76 A.2. Contributions de cette thèse lies et de certaines spécificités du protocole [22;39;43]. Nous pensons au contraire qu il est plus intéressant d utiliser le sniffing comme une technique pour étudier les usages du réseau dans des environnements difficiles d accès (par exemple des maisons, des rues, ou encore des parcs). Les questions au sujet des outils sont relatives à la manipulation des traces de paquets. En réseau, beaucoup d opérations mettent en jeu ce type de traces : les administrateurs les utilisent pour le suivi et le débogage, les chercheurs pour les mesures, la simulation, ou la validation. Les moniteurs sans fils produisent des traces de paquets, qui sont en fait des listes de trames MAC. Beaucoup d outils existent pour créer ces traces et les manipuler, mais la plupart d entre eux sont très spécifiques, et utilisent du code difficile à généraliser. Par exemple, tcpdump [8] est capable de décoder énormément de protocoles distincts, mais son code de traitement ne peut s utiliser que pour afficher des paquets dans un terminal. Wireshark [6] est plus modulaire, mais reste dans l ensemble orienté visualisation, et donc souffre de problèmes similaires. La plupart des programmes qui traitent des paquets réseaux sont bien conçus, et apparaissent efficaces vis à vis de leurs objectifs. Mais chaque fois qu il faut créer un nouveau logiciel de traitement des traces de paquets, il n est pas pratique de se reposer sur du code existant. De plus, certains outils souffrent de problèmes de performance (par exemple, Scapy [5] est un outil très puissant pour l analyse de traces, mais il n est pas utilisable sur de grosses traces 1 GB ou plus). Tout cela fait que produire des analyses personnalisées sur des traces de paquets est fastidieux. Cela requiert généralement de programmer de nouveaux outils à partir de rien. Pour l ensemble de ces raisons, fusionner des traces IEEE pose également un problème. On trouve dans la littérature quelques outils à cette fin, mais la plupart reposent sur l existence d une infrastructure filaire [22;28]. Les autres sont trop spécifiques à l expérience pour laquelle ils ont été conçus [43;44]. Afin de pouvoir généraliser le sniffing de réseaux Wi-Fi dans n importe quel environnement, il faut à la fois des outils génériques, et des outils qui ne nécessitent pas l utilisation d une infrastructure filaire. A.2 Contributions de cette thèse Les contributions de cette thèse sont doubles. D une part, nous développons une boîte à outil logicielle, nommée WiPal, pour aider à la manipulation des traces de paquets IEEE Cet ensemble inclut une bibliothèque générique pour le développement de nouveaux outils, et plusieurs utilitaires directement utilisables pour effectuer des opérations prédéfinies sur les traces. WiPal possède notamment un outil de fusion de traces innovant. D autre part, nous utilisons ces outils pour produire deux analyses. Celles-ci utilisent plusieurs jeux de données que nous avons collectés dans différents environnements, dont notamment des traces de plusieurs jours dans des zones résidentielles de banlieue et en centre ville. La

91 Annexe A. Résumé de la thèse en français 77 première analyse se concentre sur l étude de la précision du sniffing Wi-Fi. La seconde se concentre sur les usages du Wi-Fi dans ces différents environnements. A.2.1 WiPal : manipulation de traces IEEE WiPal est notre ensemble logiciel pour manipuler de traces de paquets. On peut le télécharger librement à l adresse Il est conçu pour la performance, de manière générique, dans l espoir qu il pourra être utilisé par d autres pour le développements de nouveaux logiciels, plutôt que pour servir de support à un logiciel spécifique. Bien qu il se concentre sur le protocole IEEE , il fournit plusieurs fonctionnalités indépendantes du protocole. Ce qui rend WiPal intéressant est sa conception originale, et la nouveauté de certaines de ses fonctionnalités. Dans cette thèse : Nous présentons des patrons de conception génériques pour la gestion de plusieurs types de traces de paquets. Par exemple, l utilisation d un mécanisme pipe and filters pour le traitement des traces, ou l utilisation de callbacks statiques pour générer des analyseurs syntaxiques qui soient simultanément génériques et efficaces. Nous présentons comment certaines fonctionnalités nouvelles peuvent être bénéfiques aux programmes de traitement des traces de paquets, et comment les implémenter. Par exemple, l accès aléatoire a une trace de paquet, ou l agrégation transparente de plusieurs fichiers comme un seul flux de paquets. Nous soulevons un certain nombre de problèmes qu un concepteur de programmes peut rencontrer lorsqu il écrit un logiciel de traitement de paquets. Nous présentons les techniques existantes pour y faire face, et nous expliquons quelles techniques nous avons retenues pour WiPal, et pourquoi. Nous évaluons la performance de WiPal et la comparons avec d autres programmes de traitement de traces de paquets. Les résultats montrent que la conception générique de WiPal n a pas d effet notable sur ses performances (vis à vis de la vitesse d exécution). La vitesse de WiPal se compare à du code spécialisé. Également, certaines des nouvelles fonctionnalités n ont pas d impact sur les performances, tandis que d autres, qui sont optionnelles, impliquent un ralentissement limité. Présentation générale de WiPal WiPal est constitué d une bibliothèque et d un ensemble de binaires (programmes). Les binaires constituent une interface simple et rapide à utiliser pour les fonctionnalités de haut niveau, mais ces fonctionnalités sont également disponibles à travers la bibliothèque. Par exemple, pour fusionner plusieurs traces, la commande suivante suffit : $ wipal-merge t1.pcap t2.pcap [t3.pcap...]

92 78 A.2. Contributions de cette thèse 1 #include <wipal/pcap/stream.hh> 2 #include <wipal/wifi/frame.hh> 3 4 using namespace wpl; 5 6 int main() 7 { 8 pcap::file<> f ("file.pcap"); 9 10 for (pcap::file<>::iterator i = f.begin(); i!= f.end(); ++i) 11 std::cout << wifi::type::names[wifi::type_of(i->bytes())] << std::endl; 12 } Listing A.1 Un exemple de programme qui utilise la bibliothèque de WiPal. Ce programme affiche le type de chaque trame IEEE qui compose file.pcap. Parmi les fonctionnalités de haut niveau, on trouve la synchronisation de traces (en utilisant le programme wipal-synchronize), la fusion (avec wipal-merge), la computation de statistiques (wipal-stats), l anonymisation (wipal-anonymize), et quelques opérations anodines comme la comparaison, la concaténation, ou l affichage hexadécimal (wipal-cmp ou wipal-cat, par exemple) Les fonctionnalités de bas niveau les plus importantes sont les entrées/sorties au format pcap, le décodage de trames IEEE , et le support de différents protocoles afférents. Il est important de noter que le code source de wipal-merge n est qu une coquille autour des fonctionnalités de la bibliothèque. Actuellement, les codes sources des binaires ont une taille moyenne de 122 lignes de C++ (l ensemble de WiPal, dont la bibliothèque, fait environ lignes de code). Le binaire le plus petit nécessite 44 lignes de code, et le plus gros 267. Ce code est principalement de la glu nécessaire aux techniques de programmations génériques que WiPal utilise. D un autre coté, effectuer des tâches spécifiques avec le décodeur de trames de WiPal, ou combiner plusieurs traitements dans un seul fichier exécutable, nécessite de l utilisateur qu il écrive ses propres programmes en utilisant la bibliothèque de WiPal. Le Listing A.1 montre un exemple très simple d un programme qui utilise cette bibliothèque. Architecture de WiPal La Figure A.2 présente un schéma simplifié de l architecture de WiPal. Les binaires (en haut) reposent sur la bibliothèque, qui elle-même utilise d autres bibliothèques externes. La bibliothèque est composée de plusieurs modules. Nous classons ces modules dans trois catégories : la base, les protocoles et formats, et les filtres. Base. Ces modules fournissent des fonctionnalités communes et simples, qui ne dépendent pas vraiment du domaine d application de WiPal. Par exemple, il s agit d exceptions pour la gestion des erreurs, de classes abstraites génériques, et d aides à la programmation sta-

93 Annexe A. Résumé de la thèse en français 79 FIGURE A.2 L architecture et les modules de WiPal. tique. Grâce à l utilisation de bibliothèques externes (comme Boost [1] ou GNU MP [3] ), nous tentons de rendre cette couche aussi fine que possible. Protocoles et formats. Ces modules sont spécifiques au domaine applicatif de WiPal et fournissent les fondations des traitements de haut niveau. Parmi les abstractions fournies, citons les adresses IEEE 802, les traces au format pcap, et différents en-têtes de protocoles, dont IEEE Filtres. À la base, une trace de paquets n est qu un simple flux de paquets réseaux. La plupart des algorithmes n ont pas besoin d autre chose que de lire ce flux de manière linéaire, un paquet après l autre, du début à la fin. Pour un tel mode de fonctionnement, il est tout à fait approprié d utiliser une architecture pipe and filters [17]. C est donc ce que WiPal utilise. Les différents pipes sont implémentés avec des itérateurs [32]. Par exemple, un filtre d anonymisation nécessite un itérateur en entrée, et fourni un itérateur en sortie. Parfois, certains traitements ont besoin d être adaptés pour utiliser une telle architecture. C est le cas de la fusion de traces IEEE Il faut alors la décomposer en plusieurs opérations élémentaires (un filtre effectue chaque opération) et relier ces opérations d une manière précise. La Figure A.4 montre comment WiPal décompose l opération de fusion. Toutes les opérations qui accèdent à une trace de manière non-linéaire ont besoin d une telle adaptation. Fusion de traces de paquets IEEE L un des composants distinctifs de WiPal est son outil de fusion. Cet outil fonctionne horsligne et fusionne des traces de paquets IEEE Ses principales caractéristiques sont la performance, la facilité et la souplesse d utilisation. En conséquence, sa conception ne fait

94 80 A.2. Contributions de cette thèse A. Les traces ne sont pas synchronisées et ne contiennent pas toutes les trames. B. On identifie des trames de référence qui sont communes aux deux traces. Cette information permet de synchroniser les traces. C. On ajuste les estampilles temporelles de chaque trame afin de synchroniser T 1 et T 2. D. Il est possible de fusionner les traces en comparant les estampilles temporelles. Les trames qui apparaissent en double (une fois dans chaque trace) ne sont prises en compte qu une seule fois. FIGURE A.3 Fusion de deux traces T 1 et T 2. pas d hypothèse sur les traces qui nécessiterait que les moniteurs soient reliés à une infrastructure filaire (par exemple, certains outils nécessitent une synchronisation réseau [22] ). Cet outil est également compatible avec tous les formats courants (IEEE brut, en-têtes Prism, Radiotap et AVS). Enfin, on peut l utiliser simplement en l invoquant directement sur les traces (tandis que les autres outils nécessitent des architectures plus compliquées, qui mettent généralement en jeu plusieurs serveurs [22;28;43] ). Cette thèse motive et décrit les choix de conception de l outil de fusion de WiPal : Elle propose de nouveaux algorithmes pour différentes étapes du processus de fusion. En particulier, l algorithme de synchronisation est une généralisation des algorithmes existants dans la littérature. Elle fournit une analyse de l algorithme de synchronisation ; nous montrons que que celui-ci est plus précis que les algorithmes précédents. Elle fournit une étude de performance qui montre que l outil de fusion de WiPal est un ordre de grandeur plus rapide que Wit, le seul autre outil de fusion hors-ligne publiquement disponible.

95 Annexe A. Résumé de la thèse en français 81 Nos analyses reposent sur seize traces réelles qui proviennent de quatre jeux de données (uw/sigcomm2004 [50] de CRAWDAD, enregistré durant la conférence SIGCOMM 2004, et trois jeux privés enregistrés dans des conditions différentes). Ils nous permettent de calibrer différents paramètres, de valider le fonctionnement de l outil de fusion, et de montrer son efficacité. Fonctionnement d une fusion de traces Afin de fusionner des traces Wi-Fi, il est en général nécessaire de les synchroniser en premier lieu. Cette étape corrige les estampilles temporelles de chaque trame afin que chaque trace utilise la même référence de temps. Ensuite il est possible d identifier les trames qui sont identiques dans chaque trace afin qu elle n apparaissent qu une seule fois dans l output (Cheng et al. [22] appellent cette étape l unification). Afin d obtenir une synchronisation précise (une précision d au pire 106 µs est requise), il faut extraire des trames de références. Ce sont des trames dont il a été possible d identifier automatiquement, et sans recourir à une quelconque synchronisation, qu elles sont présentes dans toutes les traces en entrée. En analysant les estampilles temporelles des trames de référence il est possible de calculer un modèle d horloge pour chaque trace qui va permettre la synchronisation. La Figure A.3 illustre ce procédé. Afin d identifier des trames de références, WiPal commence par isoler des trames uniques. Une trame est unique lorsqu elle n apparaît sur le canal radio qu une seule et unique fois durant toute la durée de la mesure. Une trame qui n apparaît qu une seule fois dans une trace mais qui est en réalité apparue deux fois lors de la mesure ne doit pas être considérée comme une trame unique. Les trames uniques sont des candidates pour devenir des trames de référence. En réalité, les trames de références sont les trames uniques qui sont partagées par chaque trace. L étape qui calcule les références à partir des trames unique est l intersection. Un schéma de l ensemble de l opération de fusion telle que la pratique WiPal est montré dans la Figure A.4. A.2.2 Applications de WiPal : analyses empiriques En utilisant les différents outils de WiPal, nous pouvons ensuite conduire des analyses sur des jeux de données que nous avons collectés en utilisant le sniffing. Cette thèse présente deux de ces analyses. La première se concentre sur la précision du sniffing Wi-Fi. La seconde étudie les usages du Wi-Fi dans des environnements sociologiquement différents. Nous obtenons toutes nos traces en utilisant des moniteurs (des netbooks) équipés de trois interfaces radios (ASUS EeePC 700 avec des adaptateurs Wi-Fi USB Netgear WG111v3, voir la Figure A.5). Les radios écoutent les canaux 1, 6 et 11. Chaque radio est configurée en mode moniteur et enregistre toute les trames qu elle entend, indépendamment du réseau.

96 82 A.2. Contributions de cette thèse FIGURE A.4 L architecture du processus de fusion de WiPal. Précision du sniffing D abord, nous collectons des jeux de données de courtes durées (de une à deux heures) en utilisant jusqu à huit moniteurs localisés au même endroit. Dans un premier temps, l analyse de ces traces révèle plusieurs défaut avec les techniques existantes d évaluation de la complétude d une trace de paquets Wi-Fi. Ensuite, nous analysons comment la complétude d un jeu de données varie en fonction du nombre de moniteurs qui compose ses traces. Défauts des techniques d évaluation de la complétude Toutes les techniques existantes pour évaluer la complétude d une trace reposent sur le fait qu un protocole, par essence, définit quelles sont les séquences de trames qui sont valides. Quand une trace contient une séquence qui n est pas valide, c est très probablement que cette séquence est incomplète. Il s agit alors de trouver un nombre minimal de trames à insérer afin que la séquence de-

97 Annexe A. Résumé de la thèse en français 83 FIGURE A.5 Un ASUS EeePC 700 avec trois adaptateurs Wi-Fi USB Netgear WG111v3 tel qu utilisé pour la collection de nos traces. vienne valide. On suppose ensuite que ce nombre est exactement la quantité de trames qui ont été perdues par le moniteur. Pour IEEE il existe deux catégories de techniques : (i) les techniques orientées messages qui se basent sur les types des trames (par exemple, une trame de management ou une trame de donnée précède obligatoirement un acquittement) [22;38;43] et (ii) les techniques orientées numéros de séquence (seqnum) qui se basent sur les numéros de séquence (par exemple, si la trame 42 suit la trame 39, c est que les trames 40 et 41 ont été perdues). Pourtant, plusieurs défauts rendent ces techniques imprécises. En partie à cause de leurs mode opératoire et en partie parce que des anomalies existent dans les traces. En effet, ces techniques supposent que chaque périphérique Wi-Fi se conforme exactement au standard IEEE ce n est malheureusement pas toujours le cas. Voici une liste des défauts que nous avons pu soulever. Les techniques existantes supposent que le réseau n est pas congestionné. Dans un environnement congestionné, beaucoup de trames échouent leurs procédures d accès au médium. Cela signifie alors que les trous dans les numéros de séquences révèlent des échecs de transmission plutôt que des pertes des moniteurs. Les techniques seqnums supposent des périphériques qui génèrent des numéros de séquences corrects. C est faux en pratique. En effet : 1. Certains points d accès réinitialisent leurs compteurs de numéros de séquence à 2048 au lieu de 4096 [53]. En fonction de la technique d analyse, cela peut conduire à surestimer ou sous-estimer le nombre de trames manquantes. 2. Certains points d accès utilisent zéro pour tous leurs numéros de séquence (nous l avons observé dans certaines de nos expériences).

98 84 A.2. Contributions de cette thèse 3. Certains points d accès gèrent en réalité plusieurs points d accès virtuels. En théorie, chaque point d accès virtuel devrait entretenir son propre compteur de numéros de séquences. En pratique ce n est pas toujours le cas, et cela conduit à une surestimation du nombre de trames manquantes. Les techniques messages ne détectent pas certaines pertes en rafale. Par exemple, les techniques messages ne peuvent détecter la perte d une trame de donnée si l acquittement correspondant a aussi été perdu. Nos études montrent que les pertes en rafale constituent une proportion significatives des pertes dans chaque trace. Impact du nombre de moniteurs sur la complétude Après avoir étudié les techniques d estimation de la complétude, nous analysons comment la complétude d un jeu de donnée varie en fonction du nombre de moniteurs qui compose ses traces. Nous identifions plusieurs résultats intéressants. Comme on pourrait s y attendre, plus le nombre de moniteurs est élevé, moins il est intéressant de rajouter un nouveau moniteur. En revanche, même en utilisant huit moniteurs au même endroit, chaque moniteur contient une petite proportion de trames qui n ont été entendues par aucun autre moniteur. En utilisant seulement deux moniteurs, on peut obtenir en moyenne entre 70% et 80% des trames qu on aurait obtenues si on avait utilisé huit moniteurs. C est à dire que la plupart des trames sont partagées entre les moniteurs. Ceci dit, il faut utiliser au moins cinq moniteurs pour dépasser 90%. Individuellement, la précision des moniteurs est très variable. Avec un seul moniteur, on peut capturer entre 45% et 90% de ce qu il aurait été possible de capturer avec huit moniteurs. Pour résumer, la plupart des trames sont reçues par plusieurs moniteurs, mais quelques unes sont très difficile à entendre. C est à dire que, sur six ou huit moniteurs, chaque moniteur contient une petite proportion de trames originales. En conclusion, il nous semble que les pertes sont inévitables, et il est donc important que les chercheurs utilisent des techniques d analyses qui restent fiables même en présence de trames manquantes. Usages du Wi-Fi en milieu urbain Dans une deuxième analyse, nous collectons et analysons des traces de longues durées (longues de trois et dix jours) obtenues dans trois environnements : un bureau, une zone résidentielle urbaine dense, et une zone résidentielle de banlieue de faible densité. Nous étudions le comportement de chaque périphérique plutôt que les caractéristiques du trafic.

99 Annexe A. Résumé de la thèse en français 85 Canal 1 Canal 1 Canal 1 1 jour 1 h 15 min 3 min 1 jour 1 h 15 min 3 min 1 sem. 1 jour 1 h 15 min 3 min Durée totale d activité 1 jour 1 h 15 min 3 min Canal Durée totale d activité 1 jour 1 h 15 min 3 min Canal 6 Durée totale d activité 1 sem. 1 jour 1 h 15 min 3 min 0 Canal Canal 11 Canal 11 1 jour 1 h 15 min 3 min 0 1 jour 1 h 15 min 3 min 1 sem. 1 jour 1 h 15 min 3 min 0 Canal Périphériques (par durées d activité) (a) Bureau Périphériques (par durées d activité) (b) Résidentiel, banlieue Périphériques (par durées d activité) (c) Résidentiel, centre-ville FIGURE A.6 Distribution des durées d activité cumulées, pour chaque trace et pour chaque station. Nous nous intéressons à des observations comme la durée totale d activité d un périphérique, la fréquence d apparition de nouveaux périphériques, et l activité que nous pouvons extraire des traces. Dans ce résumé, nous présentons deux exemples de résultats que nous obtenons en analysant ces jeux de données. Durées d activité cumulées La Figure A.6 présente la distribution des durées d activité cumulées dans toutes les traces et sur tous les canaux. Chaque impulsion représente la durée totale d activité d un périphérique pour une trace donnée. Nous considérons qu un périphérique est actif lorsqu il a émis une trame dans les dernières trois minutes (n importe quel type de trame : management, donnée, ou contrôle). Nous utilisons cette fenêtre de trois minutes car des pilotes des points d accès utilisent des temporisateurs avec des durées similaires (par exemple, les temporisateurs de MadWifi varient entre 30 secondes et 5 minutes). De plus, ne nécessiter qu une seule trame toutes les trois minutes rend la technique robuste vis à vis des pertes de trames. Un certain nombre de caractéristiques sont communes à toutes les traces. Les périphériques ne sont pas répartis de manière uniforme sur les différents canaux. Dans toutes les traces, les périphériques apparaissent plus souvent sur le canal 11 que sur le canal 6, et plus souvent sur le canal 6 que sur le canal 1. C est une conséquence directe

100 86 A.2. Contributions de cette thèse de ce que les réseaux ne sont pas répartis de manière homogène sur les différents canaux. La distribution des durées d activité n est pas uniforme pour une trace et un canal donné (remarquez que la Figure A.6 utilise une échelle logarithmique). Il y a trois classes de périphériques : (1) ceux qui sont (presque) toujours actifs, (2) ceux qui n apparaissent qu une seule fois, et (3) les autres. Au sein de l ensemble des traces, 31 périphériques (sur un total de 2.395) appartiennent à la classe (1) de ces périphériques semblent être des points d accès. Deux des quatre périphériques restants font partie de la trace de bureau, et deux dans la trace de centre-ville. Comme ils n émettent pas de balises, il ne s agit pas de périphériques en mode ad hoc. Il est donc intéressant de noter qu une poignée d utilisateurs laisse leurs périphériques allumés en permanence. Un partie significative des périphériques appartient à la classe (2) (20% au bureau et dans le centreville, 9% en banlieue). Cela signifie que beaucoup d utilisateurs ne sont pas réguliers et ne font que passer. La classe (3) est variée et inclus l ensemble des valeurs possibles. Néanmoins, plus la durée est courte, plus la probabilité est forte. La plupart des périphériques sont presque inactifs. Entre 48% et 96% des périphériques, en fonction de la trace et du canal (76% en moyenne), sont actifs moins d une heure pendant toute la durée de la trace. Donc une majorité de périphériques est inactive la plupart du temps. Sur certains points néanmoins les traces présentent des caractéristiques différentes. Les profils des traces de bureau et de banlieue sont similaires, mais dans cette dernière les périphériques ont tendance a cumuler des durées d activité plus longues. Le profil de la trace de centre-ville présente une rupture très nette entre les périphériques actifs et ceux qui sont presque inactifs. On perçoit bien ces variations si l on regarde les durées d activité moyennes : 2h36 pour la trace de bureau, 11h48 pour la trace de banlieue, et 2h21 pour la trace de centre-ville (alors même que cette trace est trois fois plus longue que les deux autres). En conclusion, dans certain environnements, en moyenne, les périphériques sont actifs plus souvent. Croissance du nombre de périphériques La Figure A.7 présente la croissance du nombre de périphérique. Chaque courbe est associée à un canal et une trace donnée (avec une courbe supplémentaire pour chaque trace, qui représente la moyenne des canaux). Chaque point montre combien de périphériques distincts une combinaison (trace, canal) contient entre le début de la mesure et le temps donné en abscisse. Nous considérons que chaque adresse MAC représente un périphérique, et nous cherchons les adresses MAC dans tous les champs de la trame. C est à dire que certains périphériques sont mentionnés en tant que destinataire 3 Un périphérique qui apparaît sur plusieurs canaux compte plusieurs fois.

101 Annexe A. Résumé de la thèse en français Canal 1 Canal 6 Canal 11 Moyenne Canal 1 Canal 6 Canal 11 Moyenne Canal 1 Canal 6 Canal 11 Moyenne jour 09 jour 08 jour 07 jour 06 jour 05 jour 04 jour 03 jour 02 jour 01 0 jour 01 jour 02 0 jour 03 jour 04 jour 05 jour 01 jour 02 jour 03 jour 04 jour 05 jour 06 0 jour 12 jour 11 jour 10 (a) Bureau (b) Résidentiel, banlieue (c) Résidentiel, centre-ville FIGURE A.7 Nombre d adresse MAC distinctes que contient chaque trace entre le début de la mesure et un temps donné. mais pas comme émetteur. C est pourquoi nous découvrons plus de périphériques que sur la Figure A.6. De plus, à cause d un détail de IEEE , certains champs d adresse comportent des valeurs qui en réalité ne correspondent pas à des adresses MAC réelles (mais des BSSID indépendants). Nous ignorons ces champs. De la Figure A.7 nous pouvons tirer un certain nombre d observations. Les périphériques ne sont pas répartis de manière uniforme sur les différents canaux. De manière curieuse, la répartition n est pas la même que celle des périphériques qui émettent des trames. Dans toutes les traces, le canal 1 est celui qui contient le moins de périphériques. C est tout à fait cohérent avec les résultats précédents (voir ci-dessus). Néanmoins, le canal 6 attire plus d utilisateurs que le canal 11 dans deux des trois traces. C est en contradiction directe avec la répartition de la Figure A.6. Une différence est que cette figure ne prend en compte que les émetteurs tandis que la Figure A.7 considère tous les périphériques. En tout état de cause, cela signifie qu il n est pas si évident de déterminer la répartition des utilisateurs sur certains canaux. La vitesse de découverte met en évidence un phénomène jour/nuit. Les courbes alternent périodiquement entre des périodes plates et des périodes de croissance. En fonction de la trace, ce phénomène est d une amplitude et d une période variable, mais on l observe dans toutes les traces. Les périodes plates apparaissent la nuit, commencent généralement aux alentours de minuit, et s arrêtent quelques heures avant midi. Cela montre, comme on pouvait s y attendre, que l activité Wi-Fi est corrélée avec l activité humaine. Dans les traces de bureau et de centre-ville, la vitesse de découverte est constante durant une longue période. De plus, dans la trace de centre-ville, cela est toujours vrai même après

102 88 A.3. Conclusion une semaine de mesure. En revanche, la trace de banlieue s aplatit au bout de deux jours. Nous pensons que cela est une conséquence directe de l environnement : en centre ville et dans des bureaux il y a une plus forte mobilité, et un turnover des individus plus important. On peut donc s attendre à ce que beaucoup de nouveaux utilisateurs apparaissent sans revenir avant la fin de la mesure. Cela explique également pourquoi les temps d activités moyen par utilisateurs sont plus élevés dans la trace de banlieue (voir ci-dessus). Notons néanmoins que même quand la vitesse de découverte chute après deux jours, il est encore possible de découvrir des nouveaux utilisateurs vers la fin de la trace. Parmi toutes ces observations, nous pensons que deux d entre elles sont d une importance particulière. D abord, comme le montre les durées d activité, les utilisateurs sont mobiles, ou bien ils éteignent généralement leurs équipements Wi-Fi. Cela donne des traces de paquet dans lesquelles la plupart des périphériques sont éteint la plupart du temps. Ensuite, les environnements ont des impacts différents sur la mobilité. Cela se traduit par des apparitions de nouveaux utilisateurs qui sont soit réparties de manière homogène, soit groupées au début de la trace. Parmi les autres résultats dont nous n avons pas parlé dans ce résumé, nous avons également noté que l intensité de l activité Wi-Fi alterne entre les zone résidentielles et les bureaux. Cela est dû au fait que que tous les environnements font partie de la vie des utilisateurs, mais à un moment précis de la journée. A.3 Conclusion Le sniffing sans fils est une technique puissante pour mesurer l activité des réseaux Wi-Fi, bien que cela pose un certain nombre de questions. Ces questions sont à la fois pragmatiques et théoriques. D une part les logiciels disponibles pour gérer les traces IEEE sont souvent insatisfaisant. D autre part, la pertinence des traces de paquets IEEE est sujette à caution. Dans cette thèse, nous abordons ces questions et apportons un certain nombre de réponses. D abord, nous développons WiPal, une boîte à outils logicielle pour faciliter le traitement des traces de paquets IEEE WiPal inclut un outil de fusion de traces flexible. Ensuite, à travers l analyse de deux traces de courte durées, nous étudions la précision offerte par des moniteurs Wi-Fi. Une dernière étude collecte et exploite trois traces de longues durées dans des environnements différents. Cela nous permet d étudier les usages que font les utilisateurs du Wi-Fi. Afin d étendre ces analyses, nous sommes actuellement en train d analyser des traces obtenues avec plusieurs moniteurs répartis dans le parc Monceau, à Paris. L activité Wi-Fi au sein de ce parc nous intéresse car celui-ci inclut plusieurs poins d accès situés à différents

103 Annexe A. Résumé de la thèse en français 89 FIGURE A.8 Position des moniteurs pour la collection de traces dans le parc Monceau. Le travail d analyse des traces est en cours. (Arrière plan : Google Maps.) endroits du parc. Avec dix moniteurs, nous avons couvert une superficie équivalente à environ la moitié du parc (cf. Figure A.8). Nos analyses sont en cours et nous n avons que peu de résultats pour le moment. Les traces incluent 138 émetteurs, dont 71 sont de marque Apple. Nous pensons qu il s agit principalement d appareils mobiles (iphone ou ipod touch). Avec un tel nombre de périphériques mobiles, il est possible que ces traces révèlent des usages nouveaux. Enfin, nous envisageons plusieurs travaux pour étendre WiPal et mieux comprendre les phénomènes précédemment observés. En effet, il est possible de rajouter le support pour de nouveaux protocoles et de nouveaux algorithmes dans WiPal, afin de montrer sa généricité et d en faire un outil universel. Nous aimerions également le rendre encore plus simple d utilisation, et améliorer ses procédures de test (et pourquoi pas, le prouver formellement?) En ce qui concerne nos mesures de la précision des moniteurs, nous aimerions effectuer des expériences contrôlées pour mesurer l impact réel de la congestion sur les moniteurs. Nous devrions également étudier pourquoi le processus de sniffing montre autant de variabilité. Il serait intéressant à cette fin d utiliser différents types de matériels, et de varier les paramètres des expériences. En ce qui concerne la mesure des différents environnements, nous avons deux questions principales. D abord, nous aimerions voir dans quelles proportions notre méthode d analyse provoque des bais de mesure. Ensuite, nous aimerions tester plus d environnements, et essayer de faire ressortir des catégories d environnements avec des propriétés similaires.

Montrer encore