RES 302 : Réseaux IP (Internet Protocol)

Transcription

1 Alberto Blanc Maître de conférences au département RSM [email protected] Jean-Pierre Le Narzul Maître de conférences au département RSM [email protected] Nicolas Montavont Maître de conférences au département RSM [email protected] RES 302 : Réseaux IP (Internet Protocol) UV1 MAJ RES : Protocoles et architectures des réseaux Notes de cours Automne 2012 Responsables du module: Alberto Blanc, Nicolas Montavont

2 Disclaimer These notes are a first draft and are provided as-is. Their goal is to be starting point, they cannot substitute note-taking during lectures nor they constitute a textbook. They do not cover all the material that will be covered during class. For a complete textbook the reader is strongly encouraged to refer to one (or more) of the following textbooks: Kurose and Ross [16], Tannenbaum [38], Bonaventure [4] (this textbook is freely available online at Toutain [40] (in French). For the sake of simplicity and, above all, to make this text (and class) easily understandable, we will not give all the details and we will not mention all the possible cases and exceptions. The reader who happens to be already familiar with any specific topic covered in this class might, therefore, be surprised by some of the statements that will be made during this class. Once more, the interested reader, is encouraged to consult one of the textbooks referenced above. 2

3 1 Application and Transport Layers 1.1 Introduction We will use a top-down approach for RES 302, starting from the application layer of the ISO-OSI reference model and ending with layer 2 (we will not cover they physical layer). Due to time limitations, we will not cover all the details of the different layers and the corresponding protocols. We will, nonetheless, strive to offer a fairly complete and coherent view of the main issues present and each layer and how they have been addressed in the Internet. Note that, while the seven ISO-OSI layer are often used as a reference model, the TCP/IP protocol suite uses only five layers: physical, data link, network, transport and application. The first four layers have almost the same role as in the ISO-OSI model, while application layer corresponds to the application, presentation and session layers of the ISO-OSI model. Note as well that the definitions and the functions assigned to each layer in the ISO-OSI are useful mainly to have a well organized and coherent views of the different elements of a computer network; they should not be interpreted as inviolable rules. As a first motivating example, we will consider Internet browsing. Nowadays we are all familiar with web-browser and the so-called World Wide Web. We will start by describing what happens when a user types a URL (Uniform Resource Locator) in the address bar of web-browser and presses enter. In this first chapter we will focus on the Application and Transport layers. 1.2 Behind the Scenes of a Web-Browser-Web-Server connection Most of us use web-browser (Firefox, Internet Explorer, Chrome, etc.) several times a day. We are used to typing an address into the address bar of the browser and expect that it will show the corresponding web-page. This seemingly simple operation requires, in almost all cases, quite a few exchanges between the end-node, the DNS server, and the web-server itself, just to name the main elements. Suppose we type in the address bar. In so doing we are requesting the following: 1. use the Hyper Text Transfer Protocol (HTTP); 2. connect to the machine whose name is rammus.labo4g.enstb.fr on port 80; 3

4 3. retrieve the index.html object. The string is often called a Uniform Resource Locator (URL) as it (uniquely) identifies a specific resource in the Internet. URLs can have more complicated forms (e.g., usernames, addresses), but these are outside the scope of these notes. It is sometimes the case that slightly different URLs correspond to the same object. For example, if the part is missing, the web-browser will use HTTP by default. Similarly, if the port number :80 is missing, the web-browser will try connect to port number 80 by default as that is the port number reserved for HTTP. Finally, if the URL does not contain a specific object at its end (index.html in the example above), the web server will return the object which has been defined as the default. In this case the default object is indeed index.html so that rammus.labo4g.enstb.fr is equivalent to This why we can use URLs like telecom-bretagne.eu, metropole.rennes.fr, meteofrance.fr, where we specify only the host name and nothing else A Very Brief Overview of HTTP HTTP is a text-based application layer protocol which is typically used by web-browsers to communicate with web-servers. It can be used as a simple request-response protocol, in which case the client (i.e., web-browser, also called user agent) sends requests to the server which is going to reply to each request either with a positive response or with a negative one, in the case of an error (e.g., the requested resource does not exist). HTTP uses a TCP connection (see section 1.3) to send requests and receive responses. Each request consists of one request line followed by one or more optional headers. The request line and each header end with two special ASCII characters: <CR> (carriage return) and <LF> (line feed). In other words HTTP uses a new line to separate different elements of the request. Obviously the header themselves cannot contain the <CR><LF> sequence as this would be interpreted as the end of the header.the request itself ends with an empty line, that is a line containing only <CR><LF> and no other characters (not even spaces). This implies that the server will wait until it receives the empty line before responding. HTTP version 1.0 [3] allows requests containing only the request line and no headers, as can be seen in the following exchange, obtained using the telnet program. Using telnet (HTTP 1.0) 1 user@ machinename :~ $ telnet rammus.labo4g.enstb.fr 80 2 Trying Connected to rammus. labo4g. enstb.fr. 4 Escape character is ^]. 5 GET 6 <html ><body ><h1 >It works! </h1 > </ body > </ html > 7 Connection closed by foreign host. The command telnet hostname portnumber can be used to open a TCP connection 4

5 with the host whose name is hostname using the port number portnumber. The telent program will simply display all the characters it receives from the server and it will send all the characters typed by the user to the server. Therefore it can be used to emulate a client using a text-based protocol, like HTTP, given that in this case all the information exchanged between the client and the sender consists in printable characters. Note that only the characters highlighted in darker gray are what the user typed, the rest is the output of telnet: lines 2, 3, and 4 explain what telnet itself is doing; line 6 is the message received from the server (i.e., the HTML document) and line 7 let us know that the server closed the connection. This is because with HTTP/1.0 a TCP connection is used for only one request/response. We can obtain the same response from the server if we specify the name of the file which corresponds to the default object: Using telnet specifying the requested object (HTTP 1.0) 1 user@ machinename :~ $ telnet rammus.labo4g.enstb.fr 80 2 Trying Connected to rammus. labo4g. enstb.fr. 4 Escape character is ^]. 5 GET /index.html 6 <html ><body ><h1 >It works! </h1 > </ body > </ html > 7 Connection closed by foreign host. This is because the web server running on rammus.labo4g.enstb.fr has been configured to use the file index.html as the default object, which is to be returned when the client does not specify one. We can use HTTP version 1.1 [11] by specifying the protocol version in the request: 5

6 Using telnet specifying the requested object (HTTP 1.0) 1 user@ machinename :~ $ telnet rammus.labo4g.enstb.fr 80 2 Trying Connected to rammus. labo4g. enstb.fr. 4 Escape character is ^]. 5 GET /index.html HTTP/1.1 6 Host: rammus.labo4g.enstb.fr 7 8 HTTP / OK 9 Date : Wed, 25 Jul : 46: 53 GMT 10 Server : Apache / ( Ubuntu ) PHP / ubuntu Last - Modified : Thu, 02 Feb : 08: 20 GMT 12 ETag : "2 a815a -2d -4 b7fe3e33c900 " 13 Accept - Ranges : bytes 14 Content - Length : Vary : Accept - Encoding 16 Content - Type : text / html <html ><body ><h1 >It works! </h1 > </ body > </ html > In this case we must use the Host header, specifying the name of the machine we have used in the URL. For reasons which will shortly be explained, this is the only way for the web-server to know the first part of the URL typed by the user. In the case of HTTP/1.1 the server also sends several headers (200 OK, Date, Server, etc.) before sending the actual object. Each header is on a single line and starts with the name of the header followed by a colon ( : ) and then the header itself. Finally a blank line separates the last header from the html file. Just like a blank line signals the end of the request (after the Host header). One of the headers (Content-Length) gives the size of the object (the html file in this case). This way the client knows when it has received the whole message. In HTTP/1.1 this is needed because the same TCP connection can be used to exchange multiple requests/responses while in HTTP/1.0 the TCP connection was used for a single exchange. As the typical web-page does contain several objects (like images, and other embedded content), it is more efficient to use a single TCP connection for multiple objects. This way it is possible to avoid the extra delay caused by the connection establishment (see section 1.3 below). As previously mentioned, HTTP is a text-based protocol in the sense that all the messages in the protocol are represented as strings of printable 1 ASCII characters, with new lines delimiting each field and with empty lines delimiting messages and separating the headers from the content of the message itself. Therefore, after the TCP connection has been established, a web-browser uses this connection to send the the bytes corresponding to the ASCII codes of the characters of the request message. Similarly, a web-server waits for incoming bytes and will interpret them as the ASCII codes of the characters 1 The ASCII codes between 0x20 and 0x7E corresponding (roughly) to the characters normally used in English texts. 6

7 representing the request The Client-Server Model HTTP uses the client-server model which, as the name implies, assigns different roles to the two end-nodes involved in the exchange. The server offers one (or more) service(s) and must be ready to accept incoming connection request from the client(s). This implies that the server process (see section 1.3) must be up and running before the client can access the corresponding service. 2 Often the server is a machine which exclusively dedicated to offering one or more services. Depending on the circumstances and the expected server load, more services can be offered by the same physical machine. For example, a web-server can be hosted on the same machine as a file server (offering a shared disk and/or other file-related services like FTP). If the expected load is important, it is often the case that a single machine will be dedicated to a single service. As servers are often dedicated, and fairly powerful, machines, the term server is sometime used to indicate such a machine, as opposed to a desktop or a laptop computer. These servers are typically located with other servers in a room and are not meant to be used directly (with a keyboard and a monitor). In a networking context, though, the term server can simply refer to the machine where there is a running process ready to accept incoming connections. The term can also refer to the process itself (e.g., the web-server process ). In this case a certain machine can be a server and a client at the same time. The term client is used mainly in a networking context and it indicates the machine or the process using the service offered by another machine. Note that it is possible to have the server and client processes running on the same machine. This is often done during the development phase of a project but can also be used in a production environment. 1.3 An Overview of TCP As mentioned in the previous section, a web-browser and a web-server communicate using a TCP connection. The Transmission Control Protocol (TCP) is a transport-layer protocol widely used in the Internet. Most measurement studies have consistently shown that it carries the overwhelming majority of the Internet traffic: between 83% and 91% of the packets and between 91% and 97% of the data [18, 14, 13]. It is therefore one of the key elements of the Internet. TCP is typically implemented in the Operating System (OS) kernel (see Figure 1.1). User space programs can access the services offered by TCP using the socket interface. 3 2 While this might be trivial, you will see during the programming exercises that it is essential to start the server process before the client one. 3 You will use this interface in the programming exercises and, indirectly, in the the project. 7

8 Application Layer TCP UDP... IP Link Layer Physical Layer User Space Kernel Space Figure 1.1: Typical networking stack implementation What TCP Offers to the Applications TCP offers a reliable service carrying an ordered byte stream between two processes (in both directions. A process is an instance of a computer programming running on a specific system (see Section below for more details). TCP uses acknowledgments to ensure that the data sent from the sender have been correctly received by the destination. These acknowledgments allow TCP to use an unreliable network layer (IP), which can loose, duplicate and re-order packets. TCP numbers each byte sent from the source so that it can assure that each byte will be delivered in the right order and exactly once. As TCP offers a byte-stream service, the sender cannot make any assumptions about message boundaries: if, for example, the sending application sends two messages (by calling the send system call), the receiving application might receive both messages in a single receive call, or in multiple ones. 4 In other words, TCP does not guarantee that message boundaries will be respected, where by message we mean whatever is sent with a single send call. This is why HTTP uses the <CR><LF> sequence to separate one header from the following one: even if the receiving application receives one byte at a time, it will be able to detect these two bytes one after the other and it will conclude that it has completely received one header. TCP is a connection-oriented protocol in the sense that the two end-nodes need to exchange a few signaling messages (notably to synchronize the sequence numbers in both directions) before they can start exchanging data. Similarly when they wish to terminate the connection they are supposed to send signaling messages indicating that they want to close the connection. As a TCP connection is a full-duplex connection, carrying data in both directions, one can think that each connection is actually made up of two half-connections, one in each direction. As these two half-connections are independent, one can be closed before the other. As shown in Figure 1.2, only the end-nodes are involved in a TCP connection. All the intermediate systems (like the IP routers shown in the figure) do not touch the TCP header and the data it contains. Routers implementing Network Address Translation (NAT, see the chapter about IP ) are one notable exception as they can modify the TCP header as well as the TCP payload. There are other exceptions like, for example, 4 The send and recv system calls can be replaced by the write and read system calls respectively, as these are basically equivalent and differ only in the presence of more flags in the send and recv calls. 8

9 end node (client) HTTP Protocol end node (server) Webbrowser Webserver TCP TCP Protocol TCP IP IP Protocol Router 1 IP Router N IP IP Protocol IP Ethernet Ethernet Protocol Eth. PPP Eth. Eth. Ethernet Protocol Ethernet Figure 1.2: HTTP and the ISO-OSI layers. HTTP proxies, but these are outside the scope of RES 302. It is important to stress that, as suggested in Figure 1.2, the TCP layer in each end-node is not aware of all the intermediate nodes traversed by the packets used to carry the TCP messages. As far as TCP is concerned there is no difference whatsoever whether the two end-nodes are directly connected (for example by an Ethernet segment) or whether they are on different continents, with many routers between them. Due to space constraints, Figure 1.2 shows only two routers, but, as suggested above, the number of routers between the two end-nodes can vary (with 0 being a possible case). IP routers are often represented as in Figure 1.2 only with layers 1, 2, and 3 (in this case layer 1 and 2 are represented by the same box) to emphasize that routers operate at layer 3 and do not consider the upper layer headers. (As layer 3 uses the services of layer 2, which, in turn, uses the services of layer 1, the presence of layer 3 implies the presence of layers 1 and 2 as well.) In reality IP routers do have an implementation of the higher layers, including TCP, which are used to handle connections terminating on the router itself. Such connections are typically used for configuration, management and monitoring purposes and are not directly involved in the main activity of the routers, namely forwarding packets Transport Layer Ports As mentioned in the previous section, TCP enables processes (typically running on different computers) to communicate with each other. Given that the same two process might wish using more than one connection between them, TCP need a mechanism to distinguish these connections. Furthermore, it is often the case that multiple processes are running on the same machine and that each one of them can open one or more 9

10 end-node 1 end-node 2 end-node 3 web-browser port number web-server web-browser IP address: IP address: IP address: Figure 1.3: Multiple TCP connections and the corresponding ports. TCP connections. In this case as well, TCP needs a mechanism to distinguish different communication end-points on the same machine. One possible solution is to use a network-wide unique address for each transport layer end-point, independent of the network layer address. TCP, instead, uses the network layer (IP) address as a part of the transport layer address. On each end-node different communication end-point is identified by a locally unique port number, which is an integer in the range (so that it can be represented using two bytes). To make sure that each end-point has a unique network-wide address, TCP combines IP addresses and port numbers: each TCP end-point is uniquely identified by the IP address of the host computer and a port number. As IP addresses are unique, this is enough to guarantee that transport layer addresses are unique as well. One advantage of this solution is that the local Operating System (OS) can choose the port number and does not need to worry about the port numbers used on different machines. The downside is that transport layer addresses use network layer addresses which is a violation of the layering principle. A TCP connection is uniquely identified by four elements: 1. the source IP address; 2. the source port; 3. the destination IP address; 4. the destination port. As a TCP connection can carry data in both directions, which end-point is called source and which destination is immaterial, the roles can be switched and the TCP connection is still the same one. Figure 1.3 shows an example with three connections; all of them use port 80 on the machine running the web-server, this is not a problem as the IP address and/or the port 10

11 number is different at the other end-point. Note that both end-node 1 and end-node 3 use the same port number (37650), but, as the IP addresses of the two nodes are different, it is possible to distinguish between the two connections. Whenever a process wishes to establish a TCP connection with another process it needs to know the IP address and the port number corresponding to the remote process. One option is for the user to specify these values explicitly. For example, we could have used the command telnet instead of telnet rammus.labo4g.enst.fr 80 in the previous section. This is because is the IP address of the machine whose name is rammus.labo4g.enst.fr. As people can usually remember names more easily than numbers, the Domain Name System (DNS, see section 1.4) can convert a name into the corresponding IP address. Most programs, including web-browsers, can handle both IP addresses directly or names and can automatically distinguish between the two. If we want we can also type in the address bar of a web-browser, instead of rammus.labo4g.enst.fr. If we type a name instead, the program will use the DNS to obtain the corresponding IP address. 5 As previously mentioned, if the user does not specify the port number when using a web-browser, the web-browser will try connecting to port number 80 by default. This is an example of a well-known port, that is a series of ports which are typically associated with a given service. Other examples are: port 23 for telnet (remote terminal), port 22 for ssh (secure remote terminal), port 25 for the Simple Mail Transfer Protocol (SMTP, used for s). All the well-known ports are in the range The Internet Assigned Numbers Authority 6 (IANA) is responsible for assigning and maintaining the list of well known ports, as described in RFC 6335 [7]. Well-known ports are also called system ports. The ports between 1024 and are called registered or user ports. IANA can also assign these ports but it uses a simpler and less stringent procedure than one used for system ports. Finally the ports between and are called dynamic or ephemeral ports and are typically dynamically assigned by the OS when a new TCP connection is established. To summarize, whenever a process wishes to open a TCP connection with another process it must specify the IP address and the port number of the remote process. For example, suppose that the remote process is on port 35. The connection can be established only if the remote process is already listening on port 35. This is true if it has previously notified the operating system that it wants to receive (and handle) all the connection requests addressed to that port. To achieve this goal the process must use a few different system calls: first the socket system call to create a new socket, then the bind system call to specify the port number it want to listen on, and finally the listen system call to block until a new connection request arrives. Clearly, only one process at a time can be listening on a given port. In our example, if there is already a process listening on port 35 the OS will return an error whenever a different process tries to bind to that port. On most systems (Unix, 5 In the case of Python programs, the socket library can directly handle both IP addresses and names. Therefore in your programs you will be able to use indifferently IP addresses and/or names

12 Linux, Windows, Mac), only processes running with administrative privileges (i.e., root in Unix/Linux/Mac) can bind to system ports Computer Processes Computer programs and OS processes are related but different concepts. 7 A computer program can roughly be described as a series of instructions operating on a set of variables. These days the instructions are typically written in a high-level language like Java, C/C++, Python, etc., and they are converted to machine code, which can be directly executed by a certain processor, by a compiler and/or an interpreter. Consider the following trivial Python script: Sample Python Script: add.py 1 #!/usr/bin/env python import sys 4 5 print "This is a simple program which adds two numbers " 6 message = u Type two numbers (put a space between them and press enter) \n 7 while True: 8 inputstring = raw_input(message) 9 inputstring = inputstring.decode(sys.stdin.encoding) 10 fields = inputstring.split() 11 a = int(fields[0]) 12 b = int(fields[1]) 13 print "the answer is: ", a + b which just prompts the user for two numbers and then prints their sum. This operation is repeated forever as the looping condition is always true. This program contains a few variables, including a and b, which always contain the last two numbers given by the user. Whenever this program is executed, for example by typing python add.py in a shell (sometimes called terminal ), the OS will execute the python command by loading into memory the executable code and executing the first instruction. Eventually the interpreter will read the add.py file and execute each instruction in it, storing the corresponding variables in memory. When the OS executes the python command it creates a new process, that is a new instance of a running program. Roughly speaking, a process is alway associated with at least three things: the executable code (the machine code instructions), the data section (where the variables are stored), and the instruction pointer which points to the current instruction and which is updated each time a new instruction is executed. You can think about the instruction pointer as something roughly equivalent to the highlighted line shown by most debuggers when you execute a program step-by-step. Most modern OSs can execute multiple programs at the same time, even if the computer has a single CPU. This is achieved by time multiplexing: time is divided in 7 Threads are yet another related concept, which is outside the scope of RES 302, just be aware of the fact that threads and processes are not the same thing. 12

13 slots (typically a small fraction of a second), in each slot the OS schedules a different process by instructing the CPU to executed the instruction specified by the instruction pointer of the process in the question. During each slot the CPU executes the instructions of a single process. At the end of each slot an interrupt (controlled by a timer), gives control back to the OS which selects the process which will run next. Clearly, it is possible to execute more than once a program and it is possible to start a new instance before the previous one has terminated. For example, by using two different shells, it is possible to have two instances of the add.py script shown above. Each instance will have its own variable a with potentially different values (the same is obviously true for the variable b as well). This because the two instances correspond to two different processes with two separate memory areas for storing variables. Each process has a unique identifier (typically called process ID ) so that it can be distinguished from all the others. It is possible to use the fork system call to create a new process by cloning an existing process. The only difference between the two cloned processes is the return value of the fork call: 0 for the newly created process (often called the child process) and a positive number corresponding to the process identifier of the child process for the original process (often called, unsurprisingly, the parent process). In the case of error the cloning process does not take place and the parent process receives a negative return value from the fork. It is useful to mention that, while the parent and child process do not share any memory (so that variables can take different values in each process), they do share open files (including sockets as these are just a special case of a file). 8 These are just the main points related to processes. This section is meant as a very general and high-level introduction to what processes are as they are fundamental actors in computer networks (being the origin and destination of end-to-end transport layer connections). As we do not need to cover all the details in order to understand their role in computer networks, we will not dwell any further on this topic. The interested reader is referred to any standard OS textbook (e.g., Abraham et al. [37], Tanenbaum [39] ) where one can find all the many aspects that were not covered in this short introduction. 1.4 The Domain Name System (DNS) As we will see in the chapter about IP, in an IP network, each interface has a unique IP address, so that it can be distinguished from all the other interfaces in the network. As mentioned in section 1.3.1, TCP uses IP addresses in order to specify the end-points of a TCP connection, and the same is true for the User Datagram Protocol (UDP) (which is another transport-layer protocol). As already observed in section 1.3.2, it is easier for users to remember names rather than IP addresses (which are a sequence of numbers). The Domain Name System (DNS) addresses this problem by mapping names to IP addresses. DNS can also answer so-called reverse lookup queries by returning the name(s) associated with a given IP address. 8 You should remember this when you use the fork system call in the first set of programming exercises. 13

14 While IP and TCP go back to the very beginning of the Internet (in the 1970 s), DNS was not introduced until 1987 ([19, 20]). One of the first solution to the name-ip address mapping problem was actually based on storing the mapping information in a file on each host. Traces of this solution are still in use today as several operating systems use the hosts file where one can specify mappings between names and IP addresses. This file is examined before any DNS query is made, so that its mapping have the highest priority and can override the normal mapping stored in the DNS. Clearly a solution based on every machine having a configuration file with all the mapping information is not scalable, not to mention that is far from easy to manage and to update it. The modern DNS can be described as a distributed data-base, where by distributed we mean that there is no single node with all the information, rather each node contains only a subset of the information and knows where to look for the missing parts. Given the scale of the Internet, it is not feasible to have a flat naming space, that is a naming scheme where all the names have no hierarchy and are on the same level. In such a scheme it would be very hard to add new names as one would need to find a name which has not yet been used. It would also be really hard to distribute the content of the mapping information between several nodes, as there would be no obvious ways of determining which node is responsible for each specific name. In contrast, in a hierarchical solution, especially in a tree-based one, it is very easy to distribute the mapping information among several nodes and it is also easy to determine which servers are responsible for storing that information (this greatly simplifies the query part). 9 The name space used in the Internet is indeed based on a tree structure, with each level in the tree corresponding to a different (sub)domain. Dots separate each level in the name, with the top level domain (TLD) being the rightmost one, and with all the subdomains to its left. For example in the name the top level domain is eu which, being a TLD, is a child of the root of the naming space; telecom-bretagne is a subdomain and www is a leaf in the tree as it is the leftmost part of the name. Normally leaves correspond to individual computers. The height of the tree is not limited to three as in this example, for example the name rammus.labo4g.enstb.fr has two subdomains. Figure 1.4 shows a (very tiny) subset of the Internet name space, where the root of the tree is represented by a dot. The in-add.arp subtree represents the inverse mapping between IP addresses and names. For example to obtain the name corresponding to the , one would ask the DNS to resolve the name in-addr.arp, where the order of each byte of the IP address is reversed. The fully qualified name of a leaf is the name which contains the leaf itself and all the subdomains between the leaf and the root of the tree. 10 Each fully qualified name has to be unique across the Internet. Thanks to the tree structure, the uniqueness of each 9 To increase reliability, several servers contain the same mapping information, according to RFC 1034 [19] (section 4.1) each mapping has to be stored on at least two servers. 10 absolute name and complete name are equivalent terms to fully qualified name. 14

15 . arpa com nz fr de eu in-addr 192 cnn google lefigaro enstb lemonde telecom-bretagne 108 www maps labo4g rsm www intranet rennes rammus forge chuao Figure 1.4: The graphical representation of a subset of the Internet name space fully qualified name is guaranteed if no two children of a node have the same name. For example, in Figure 1.4 there are two nodes with the www name but as they have different parents (cnn and telecom-bretagne) they correspond to two different fully qualified names. Thanks to the tree structure it is possible to divide the mapping information among several servers in such a way that the name itself indicates which servers to contact. In theory it is enough for each node in the tree to know the IP address of the server responsible for each child. Suppose we would like to find the IP address corresponding to the name First we need to find the IP address of the server responsible for the.eu domain. This is done by asking one of the root servers, whose IP addresses are well known. Once we know the IP address of the server responsible for the.eu domain we can contact it and ask the IP address of the server responsible for the telecom-bretagne subdomain. With this information we can finally contact the server responsible for the telecom-bretagne subdomain to obtain the IP address corresponding to the name In theory, whenever a host 11 needs to find the IP address corresponding, it could use a similar procedure to query all the servers himself. In reality, each IP host knows the IP address of one or more local DNS servers. 12 Where by local we mean the network distance, in terms of hops, between the host and the DNS server is small. Most Internet Service Providers (ISP) and quite a few institutions (universities, research centers) and most large business/companies have their own DNS servers. One important advantage of such a solution is that each server can cache the responses it receives so that, if several 11 End-nodes in the Internet are often called hosts. 12 The IP address(es) is (are) either configured by hand or are obtained using DHCP (see the chapter about IP ) or a similar protocol. 15

16 s 1 root server local server s 2 local server 5.fr server client (end node) s 3 client (end node) 7 lemonde.fr server (a) A recursive query. (b) A mixed query. Figure 1.5: Different DNS query types. messages are exchanged. The numbers represent the order in which hosts ask for the same name, it will be able to respond to the query without contacting all the servers responsible for each (sub)domain. Note that this is useful even when different queries share just part of the name. For example, several users at Telecom Bretagne will probably ask to resolve names ending in.fr. (To resolve a name means to find the IP address corresponding to a given name.) In this case the DNS server at Telecom Bretagne can store in its cache the IP address of the DNS server responsible for the.fr domain so that it can contact it directly without first contacting one of the root servers. Each time a DNS server sends a reply (either to an end-node or to another server), it indicates for how long the response should be cached (the Time To Live, which should not be confused wit the Time To Live field in the IP header). DNS is also the name of the protocol used to query DNS servers. Typically this protocol uses UDP but it can also use TCP. For the purposes of RES 302 it suffices to know that DNS is a request-response protocol, with a common header for requests and responses. Until 2010, it supported names using exclusively the Latin alphabet (excluding letters with diacritical marks like accents). Since 2010, the DNS does support other alphabets (RFC 5890 [15]). DNS supports two different query types: iterative and recursive. In a recursive query each server asks the following one to resolve the name, as shown in Figure 1.5(a). Suppose that the client asked the local server to resolve the name rammus.labo4g.enstb.fr the client contacts the local server, which already has in its cache the IP address of the server responsible for the.fr domain, which is s 1. The local server then contacts s 1. Instead of telling which server to contact next, s 1 contacts the next server (s 2 ) directly, in this case s 2 is the server responsible for the enstb.fr domain. Similarly, s 2 contacts s 3 (the server responsible for labo4f.enstb.fr). The response then follows the same path in the reverse direction. One of the problems with recursive queries is that each server is forced to store a certain amount of state information for each pending query so that it can match the 16

17 response to the requests and forward them correctly. Another problem is that the a server accepting recursive queries must process more packets (compare Figures 1.5(a) and 1.5(b)). According to RFC 1035 [20], support for recursive queries is optional. One of the flags in the header of each DNS packet indicates whether the request is recursive or not. Each of the servers in Figure 1.5 can decide (at least in theory) whether to use a recursive or iterative query. In an iterative query, if the server does not know the IP address corresponding to the name, it replies with the IP address of the next server to contact. Iterative and recursive queries can coexist in a DNS request as shown in Figure 1.5(b), which is actually the typical case. The request from the client to the local server is recursive, while the local server uses iterative queries to resolve the name. In this case the client asked for the IP address of Because of the problems mentioned above, the typical (and properly configured) DNS server accepts recursive queries only from a limited and well identified set of IP addresses (for example, in the case of an ISP, the IP addresses of its customers). So far we have talked about DNS servers and subdomains, without detailing the relationship between them. This is because the relationship is a matter decided by the administrator of each subdomain. There are only two requirements: one is to have a server (actually at least two for redundancy reasons) in charge of each non-leaf node of the tree corresponding to the Internet naming space, the other is for each server corresponding to a non-leaf node to know the IP address of all the servers responsible for its children. It is often the case that the same server is in charge of multiple subdomains. For example, the DNS servers at Telecom Bretagne are responsible for the subdomains telecom-bretagne.eu, enstb.fr and enst-bretagne.fr. The DNS servers in charge of the.eu and.fr know the IP addresses of these servers. Finally it is worth mentioning that the system does use the DNS as well. Whenever an server has a message to send, for example to [email protected] it asks its local DNS server the IP address of the mail server responsible for the telecom-bretagne.eu domain. The querying mechanism is the same described above, with the only difference being that the DNS request specifies that it is about an domain and not a hostname. 17

18 2 Addressing in General 2.1 Introduction The goal of this chapter is to present in a general way the problems associated with addressing in computer networks and the interactions between addressing and routing. These problems present themselves at different layers, notably at layers 2 and 3. By definition, there are multiple nodes in a computer network and one needs a way to distinguish each one. One, if not the main one, function of an address is to uniquely identify a node or, in the case of TCP/IP and IEEE 802 networks, a specific interface. We have already seen in the previous chapter how TCP uses a combination of IP addresses and port numbers to uniquely identify communication end-points. Without such unambiguous addresses it would not be possible to establish a communication with the desired end-node. Also in the previous chapter, we have seen how names can be used to identify an end-node instead of its IP address and how DNS provides the mapping between the names and the IP addresses. One could say that names are the addresses at the application layer. It is indeed the case that each layer of the ISO-OSI model uses its own addresses, which are independent of the addresses used at the other layers. The notable exception being the transport layer in the TCP/IP protocol suite, as it uses a combination of layer 3 addresses and port numbers. Clearly the consequence of this violation is that TCP is intimately linked to IP and it cannot be used with other layer 3 protocols without important modifications. This limitation turned out to be irrelevant in practice as IP is without any doubt the most widely used layer 3 protocol today. Yet it is a limitation. Compare this with the total independence between IP and layer 2 protocols: IP can be used with many different layer 2 protocols. Of course one always needs, at the very least, a way to find the layer 2 address corresponding to an IP address but this is addressed by specific protocols (for example ARP in the case of IP and IEEE 802 networks, see the chapter about IP ). Just like postal addresses, addresses in a computer network can also be used to identify the location of a node, and we are going to consider only this solution in RES 302. While this might seem as a natural solution, one should be aware of its consequences. 2.2 Intermediate Nodes and Routing As we have already observed, there are always multiple nodes in a computer network and the network itself must interconnect all these nodes. One possible, but obviously lousy, solution is to use a fully mesh, where each node has a direct link with all the other nodes in the network. Obviously, the problem with this solution is that the number of 18

19 (a) A full mesh. (b) A more realistic network. (Circles correspond to end-node and squares to intermediate nodes.) Figure 2.1: Different network topologies. links grows as the square of the number of nodes. Figure 2.1 shows the full mesh and a more realistic topology, where squares represent intermediate nodes. The role of the intermediate nodes is to route each packet towards its destination. Note that such nodes are not only present at layer 3, where they are called routers or gateways, they can also be present at layer 2, where they are called switches. Sometime the term layer 3 switch is used to indicate a router but its usage is fairly rare. Throughout RES 302, whenever we use the term switch we refer to a layer 2 device. Similarly a router will be a layer 3 device, with the exception of this chapter where we will use router as an equivalent of intermediate system. Similarly we are going to use the term packet to represent layer 3 as well as layer 2 messages, which are more appropriately called frames The Routing Table As the role of intermediate systems is to route packets, they need to know where to send each packet so that it will reach its destination. This information is stored in what is typically called (somewhat inappropriately) the routing table, whose name should rather be forwarding table as it is used to forward packets. 1 We are going to consider a simple example in order to better explain what a routing table is. Figure 2.2 shows a simple network with p hosts connected to the intermediate 1 This distinction is somewhat beyond the scope of RES 302 but the main idea is as follows: routers do have a routing table as well, which contains all the routing information known by the router: for example a router might be aware of multiple paths for reaching a certain node but, in this case, it selects only one rout and this route is inserted in the forwarding table. 19

20 s 1 d 1 s 2. i 1 i 2 i 0 i 0 X 1 X 2 i p i 1 i 2 i q d 2. s p d q Figure 2.2: A simple dumbbell network. node X 1 and q hosts connected to intermediate system X 2. (This topology is called a dumbbell and it is often used in network simulations as it is the simplest topology with a bottleneck link: if there is enough traffic between the nodes on the left ( sources ) and those on the right ( destinations ) the link between X 1 and X 2 can be a bottleneck and experience congestion.) Whenever r 1 receives a packet it must decide on which output interface (labeled i 0,..., i n ) to send it. The same is true for X 2. For the moment we are assuming that, like in Figure 2.2, there is only one host connected on each interface, therefore it is enough to specify the output interface. In the case of multiple access networks, like Ethernet, this is not always the case. In the context of Figure 2.2, it is therefore enough for X 1 and X 2 to have a mapping associating an address with the appropriate output interface, for example the table of X 1 might look something like this: Destination Address Output Interface s 1 i 1 s 2 i 3 s 4 i 4 d 2 i 0 d 5 i 0 Where, for the sake of simplicity, we have used the name of each node as its address. Clearly, one possible solution, is to store all the p + q different addresses in the routing table of X 1 and X 2. 2 In this case there are no constraints on the location of each address, for this reason this solution can be described as a flat address space. Another solution is to organize the addresses in such a way that all the addresses on the left of the figure share the same prefix (i.e., the first part of their address), just like in the telephone network. In this case there are constraints on where a certain address can be used. This 2 For the sake of simplicity we are momentarily assuming that X 1 and X 2 do not have addresses. 20

21 solution is often called hierarchical addressing as there is a hierarchy in the address (a prefix and the rest). Also all the addresses with the same prefix can be grouped together and can be represented only by their prefix, something which cannot not be done with flat addressing. We will further discuss flat and hierarchical addressing in the next section. It is worth mentioning that intermediate systems have to search for a match in the routing table for every packet they route. Therefore its size can have a significant impact on the forwarding performance, in terms of time needed to forward a packet. These days it is common to have links operating at a few Gbit/s, in this case a router can receive more than a million packets per second on each link and for each one of these packets it needs to find a match in the routing table to determine the output interface. 2.3 Different Ways of Organizing Addresses In this section we are going to describe several different ways of organizing an address space. In order to better explain some of the ideas, we will present a few example using IP version 4 (in short IPv4) addresses. An IPv4 address is a 32 bit number which is typically represented in dotted decimal notation: each one of the 4 bytes in the address is represented as a decimal number (between 0 and 255) and each byte is separated by a dot (e.g., ) Flat Address Space When there are no constraints on where addresses can be used the resulting address space can be described as flat as there are no boundaries and no address is different from another (in the sense that they all have the same properties). One advantage of this solution is that assigning addresses is very simple: whenever a new address is needed one only needs to make sure the a certain address is not already in use. The downside of this solution is that the size of the routing table grows linearly with the number of nodes in the network as the routing table need to store an entry for each node in the network. It is therefore impossible to use such a solution in a large network, such as the Internet. At the same time the lack of constraints in the addressing plan can be an important advantage. As we will see shortly, while hierarchical addressing can have significantly smaller routing table, it does introduce strict constraints on how addresses can be used in the network. Ethernet networks do use a flat address space. As you have seen in RES 301, the uniqueness of each address is guaranteed by the network interface manufacturer. Learning bridges, which are the Ethernet intermediate systems, exploit the fact that Ethernet networks have a limited size by using flooding to forward packets for which there is no match in the switching table. 3 Flooding means forwarding a packet on all the output interfaces, except the one on which it was received. As the name implies a packet can end up being forwarded to every node in the network, at the same time this guarantees 3 In the case of layer 2 devices, like bridges and switches, the routing table is actually called switching table. 21

22 that the packet will be delivered to the destination! The combination of flooding and of the learning algorithm allows the size of the switching table to be manageable. Basically it grows linearly with the number of end-nodes which are actually sending data at any given time, as opposed to the total number of end-nodes in the network. These are just a few of the main ideas behind learning bridges, see RES 301 for more details Hierarchical Addressing If we used flat addressing in the network depicted in Figure 2.2, the routing table of X 1 would have the p entries corresponding to all the machines on the right, yet for all these entries the output interface is the same, as the rout is the same. If we assign addresses in such a way that all the addresses of the nodes d 1, d 2,..., d q share the same prefix, we could then reduce the size of the routing table by replacing the q different entries with only one, with the prefix in the address column and i 0 in the output interface part. In packet switched networks, like IP and IEEE 802 networks, every packet needs to include, at the very least, the destination address, as every packet is treated independently and must contain all the information needed to route it to the destination. Typically the source address is included as well, so that the receiver knows who has sent the packet and, in the case of errors, the intermediate nodes (and the receiver) can notify the source. Each router handling the packet must read the destination address and use it to find a match in the routing table. In order to efficiently implement this operation the address length is typically fixed as it would be too time consuming to parse a packet with a variable address length. Therefore it is reasonable to assume that, in a given network, all the addresses have the same fixed length. While it is possible, at least in theory, to use names as addresses at layers 2 and 3, numbers are used instead as computers (including intermediate nodes) can handle them faster and more efficiently than names. In the remainder of the chapter we are going to assume that addresses are binary numbers with n bits. In the case of IPv4 n = 32, for IPv6 n = 128 and for IEEE 802 networks n = 48. Just like in telephone numbers, the prefix corresponds to the most significant digits. A prefix of length k (0 k < n) is the set of the 2 n k addresses whose first k bits have the same value. Note that k = 0 is a special case corresponding to the whole address space, this is useful in the longest prefix match algorithm, as discussed below. We use Greek letters to denote prefixes: α k is a prefix of length k. As there are 2 k different prefixes of length k we use the notation α k,j, with j [1, 2 k ], to indicate a specific prefix. Fixed Prefix Length Recall the the idea behind hierarchical addressing and prefixes is that the intermediate nodes know only about prefixes and not about individual addresses, leading to smaller routing tables. For the sake of simplicity, in Figure 2.2 each end-node has its own link to the closest intermediate system. If we make the assumption that, like in IP and IEEE 802 networks, each interface must have its own address, we would need 2 p different prefixes for the s i nodes (and the corresponding interface at X 1 ) and 2 q prefixes for 22

23 the d j nodes. Thus we need a prefix of at least h min log 2 2 (p + q + 1) bits, where x is the smallest integer grater than x. We need p + q + 1 prefixes because we also need two addresses for the i 0 interface of X 1 and X 2. Any longer prefix would work as well, i.e., any prefix length such that h min k n 2 would work. As we need at least two addresses in each prefix (one for each interface), the longest possible prefix length is n 2. If we chose k = h min we have larger prefixes, so that we could handle, for example, multiple access networks, where it is possible to have multiple nodes connected to the same network and not just two as in Figure 2.2. In other words, there is a trade-off between the length of the prefix (k) and the number of addresses in it. Note that we use the term longer when talking about the length of a prefix, while we use the term larger to refer to its size, i.e., the number of addresses belonging to the prefix. The longer a prefix is, the smaller its size. One possible solution is to use a variable prefix length, in order to better fit the characteristics of each network. At the same time a variable prefix length increases the complexity at each intermediate node: whenever it routes a packet it needs to determine the prefix of its destination address, so that it can search for it in the routing table. Furthermore the intermediate node might need to be capable of determining the length of the prefix of each address (this is not needed in the case of the longest prefix match, discussed below). From the point of view of optimizing the routing of packets, it is best to use a fixed prefix length. But, as we have observed above, this introduces the problem of choosing the appropriate prefix length. Until the mid 1990s, the Internet adopted a compromise using three different prefix lengths: 1, 2 and 3 bytes. If a network uses only one prefix length, there is no need to specify anything else than the prefix length k: the address space is divided in 2 k groups, each one with 2 n k addresses. If we have multiple prefix lengths, like in the Internet, we also need to specify how each prefix length is allocated in the address space. Figure 2.3 shows how the IPv4 address space was divided before Half of the address space (all the addresses starting with a 0) is allocated to the prefixes of length eight. As the most significant bit is set to 0, there are 2 7 = 128 different prefixes of length eight, and each one has 2 32 addresses in it. Half of the remaining address space (a quarter of the total) is allocated to the prefixes of length 16: there are 2 14 = such prefixes each with 2 16 = addresses (the first two bits of each prefix in this group are 10, hence there are only 2 14 such prefixes). The remaining quarter of the address space is divided between the prefixes of length 24, the multicast addresses (see section 2.5) and addresses reserved for future use. The prefixes of length 24 take half of this remaining space, corresponding to an eighth of the total. There are 2 21 (24 3 as the first three bits are set to 110) such prefixes, each with 2 8 = 256 addresses. Each of the these address blocks is also called a class, with class A being the class of the prefixes of length eight, class B for length 16 and so on (as shown in the figure). By definition the Internet is a network of networks: an inter-network. Each prefix was meant to be allocated to a different network, hence the bits corresponding to the prefix in the address are often called the network part of the address, while the remaining bits, meant to distinguish each host in a network, are called the host part, as shown in Figure 2.3. There two constraints on how prefixes can be attributed to networks: 23

24 32 bits Class A 0 network host B 10 network host C 110 network host D 1110 multicast group ID E 1111 reserved for future use Address Range from to from to from to from to from to Figure 2.3: The IPv4 address space before first, all the hosts in the same network must be able to exchange packets directly at layer 2, second, all the interfaces of a router must belong to different networks. This implies that routers are only involved in exchanges between hosts on different networks, this is not surprising, given that the role of a router is exactly to connect different networks. Not that this implies that even point-to-point links need their own prefix even though only two addresses are used. It is worth underlining the fact that all the prefixes (i.e., network addresses) used in IPv4 before 1994 do not overlap, that is the intersection between two different prefixes is always empty. In other words, the whole address space was divided in a given number of non-overlapping blocks of different sizes. Furthermore, looking at the first byte, it was possible to determine the length of the prefix, and, therefore, to which block each address belonged. Whenever a router received a packet it would examine the first byte of the destination address, determine the prefix length, extract the prefix and search for it in its routing table. It is easy to compute the worst case size of a routing table: the worst case is when all the prefixes have been allocated, corresponding to a total of = different entries in each routing table. Clearly this is a worst case and it is still a significant improvement over a flat address space, as this would have required 2 32 entries in each routing table! Yet it is far from trivial given that, as we have already mentioned, that routers have to search for a match in this table every time they route a packet. One could even say that using a fixed prefix length is basically equivalent to a flat address space using shorter addresses: as we have just seen, in the worst case, routers do need to store a route for each prefix. In some cases, it is possible to reduce the size of the routing table by using a special entry called default route: whenever the destination address of a packet does not match any of the entries in the routing table, it is forwarded along the default route. In IP networks the address is used to indicate the default route as this address is 24

25 131.33/16 i i 3 4 i /16 r / / /24 i 1 i 4 i /16 r 3 i 2 i 2 r 3 5 i 1 i 2 i 1 i 5 i 4 i 3 r i 2 i 6 6 i 1 103/ / /16 i 4 r 4 i 3 i 5 i 1 i 2 i 3 i 2 r 5 i / /16 57/8 Figure 2.4: A simple network where prefix aggregation is possible. reserved and cannot be assigned to an interface. As an example, consider the network shown in Figure 2.4, where the number after the slash is the length of the prefix (e.g., /16), this notation is often used, even though, in the case of the Internet before 1994, this is not needed as the length of the prefix is uniquely determined by the value of the first byte of the address. Note that this network is obviously not a realistic representation of the Internet, except at the very beginning when it was an experimental network connecting a few institutions in the US. Table 2.1 shows the routing table of r 2, it easy to notice that several prefixes correspond to the same output interface in the table. One could even say that, as far as r 2 is concerned all the prefixes routed through the same interface are basically equivalent and could potentially be represented by only one entry in the routing table. We will see that with a variable prefix length we can indeed merge multiple prefixes and reduce the size of the routing tables. Note that all the routing tables in the network of Figure 2.4 have the same size (11) as each one of them must contain all the prefixes present in the network. 4 Finally, note that, in Figure 2.4, there are dotted clouds even on the point-to-point links between routers, this is to underline the fact that, in IP networks, each link must have its own prefix, even though only two addresses are used. For the sake of simplicity these prefixes are not shown in the figure but one should not forget their existence. The solution based on three different fixed prefix lengths was a reasonable compromise when the Internet started as a research project in the mid 1970s but, as it grew into a worldwide network, its limitations became an obstacle to its growth. One of the main problem was the uneven distribution of the address block (prefix) sizes. Half of the address space was allocated to a few very large networks (class A) and one eighth of the 4 In this case the default route cannot offer any significant reduction in the size of the routing tables, therefore we will ignore this option for the time being. 25

26 Destination Prefix Output Interface i i i i i i i i 2 57 i i i i 3 Table 2.1: The routing table of r 2 in the network shown in Figure 2.4, in the case of fixed prefix length. address space was allocated to a lot of very small networks (class C). The result was that class B addresses were the most widely used and there was very little demand for class C addresses. This is because, with the widespread adoption of local area networks (mainly Ethernet) most layer 2 networks had at most a few thousands machines and most of them had more than 256 machines. Another important problem, again a consequence of the fixed-size address blocks, was that the size of the routing table kept growing linearly with the number of prefixes being allocated. Note that, if the allocated prefixes grow exponentially, so does the size of the routing table. Variable Prefix Length One possible solution to this problem is to aggregate multiple networks (prefixes) in larger ones, so that multiple entries in the routing tables can be replaced by only one. In 1994 the Internet switched to Classless Interdomain Routing (CIDR) removing the three fixed prefix lengths and using a variable prefix length instead [31, 12]. A variable prefix length means the the address space is not divided anymore in blocks of a fixed size, instead it can be divided in an arbitrary number of blocks of different sizes, with certain blocks strictly contained in other ones. Even in an network as simple as the one shown in Figure 2.4 it is possible to aggregate several prefixes and use a single (shorter) prefix to represent them. Consider, for example, the four networks connected to r 1, as they are only connected to r 1, and as they obviously share a common (shorter) prefix, namely 131/8, they can be advertised to the rest of the network using this shorter prefix. It is important to note that two conditions must be satisfied for the aggregation to be possible: first, the networks must be close to each other in the network, second, they must share a common prefix. The three networks 26

27 connected to r 6, and the two connected to r 4 meet only the first condition. The alter reader might have already noticed that it is actually possible to use the /11 prefix given that the four (longer) prefixes: /16, /16, /16 and /16 have exactly the same values in the first 11 bits. This shows how it is always possible to use a shorter prefix: obviously if a set of addresses have the same values for the first k bits they also have the same values for the first i bits i k. Note that, typically, when writing prefixes in decimal notation, the part after the prefix is set to 0 but this it not always the case and can lead to confusion: for example the /11, /11, /11 are three of the many different ways of writing exactly the same prefix, as the first 11 bits are the same in each case, only the bits after the eleventh one are different. By observing all the networks connected to r 1 and r 2, we can see that they all share the 131/8 prefix. Note that this is the longest prefix they share as the networks connected to r 2 have values larger than 128 in the second byte, while those connected to r 1 have values smaller than 128 in the second byte. It is therefore possible to use the 131/8 prefix to characterize all the networks connected to r 1 and r 2. More precisely, using a variable prefix length like in CIDR, it is possible to replace six entries (131.33/16, /16, /16, /16, /16, /16) with just one (131/8) in the routing tables of r 3, r 4, r 5, and r 6. One can also reduce the number of entries in the routing table of r 2, by replacing the four entries for the network connected to r 1 (131.33/16, /16, /16, /16) with only one using the prefix /11. In the network shown in Figure 2.4, there are other networks whose prefixes could be aggregated, given that these networks are close to each other: the two connected to r 4 and the three connected to r 6. This is not possible, though, as the prefixes allocated to these networks are not contiguous and do not have a common prefix, other than the prefix of length 1 for 195 and 151, which is not very useful as it is too short and it is also shared by the networks connected to r 1 and r 2. Changing the prefix allocated to these networks is the only solution, if we want to be able to aggregate them. This operation is also called renumbering as it changes the numerical addresses of all the machines involved. It can be a fairly complicated operation given that, at the very least, it is necessary to reconfigure all the routers involved and some other devices like configuration servers (see the chapter about IP ). Yet, in certain cases, the the benefits outweigh the costs and such operations do take place in the Internet. As we will see in section 2.4.1, IP prefixes are now allocated on a regional basis, with each region corresponding, roughly to a continent. All the prefixes allocated to a region are administered by the regional registry, whose role is to assign sub-prefixes to large Internet Service Providers (ISP), which in turn, will allocate sub-sub-prefixes to theirs customers, which can be other (smaller) ISPs and/or edge networks. Note that we have used the terms sub-prefix and sub-sub-prefix only to emphasis the fact that, once a certain prefix has been assigned to an entity, this entity can further divide the prefix it has received in a any way it wants and allocate parts of it to its customers. In the remainder of this chapter we will always use the term prefix, with the understanding that each prefix can be part of a shorter prefix and, likewise, it can contain several longer prefixes. 27

28 ISP B using prefix α k1,1 ISP A using prefix β k2,2 n 2 : α k1 4,2 i 2 i n 13 : r 1 10 α k1 4,9 i 2 r 9 i 1 n 15 : β k2 3,6 i 2 r 8 i 1 n 14 : β k2 3,5 n 1 : α k1 4,1 i 4 i 3 i 2 r 1 i 1 n 3 : α k1 4,3 n 12 : β k2 3,4 n 11 : β k2 3,3 i 2 i 1 r 7 n 4 : α k1 4,4 i 4 i r 3 i 2 i 2 r 1 i 5 3 i 5 i 2 i 1 i 4 r 6 i 6 i 1 i 3 i 2 n 10 : β k2 3,2 n 5 : α k1 4,5 n 6 : α k1 4,6 i 4 r 4 i 3 i 5 i 1 i 2 i 3 i 2 r 5 i 1 n 7 : α k1 4,7 n 8 : α k1 4,8 n 9 : β k2 3,1 Figure 2.5: A simple network with two ISPs: the one on the left of the dotted line uses the prefix α k1,1, the one on the right of the line uses the β k2,2 prefix. Figure 2.5 shows a network similar to the one in Figure 2.4, the main difference being that a few networks and routers have been added. Suppose that there are two different ISPs in the network: one on the left of the dotted line (A) and one on the right (B), the α k1,1 (i.e., the first prefix of length k 1 ) has been allocated to the operator on the left and the β k2,2 has been allocated to the operator on the right, the fact that the second operator uses the second prefix among all those of length k 2 guarantees that, even if k 1 = k 2, the two prefixes used by each ISP are indeed different. Each network has been labeled n i with i = 1, 2,..., 15, so that we can refer to a specific network using this name. ISP A has 9 networks, while ISP B has 6, therefore A needs to allocate at least 4 bits do identify each network, while B needs only 3. In other words, A needs to allocate sub-prefixes of length at least k 1 4 to each one of its 9 networks, as log 2 9 = 4, and B needs to allocate sub-prefixes of length at least k 2 3 to each one of its 6 networks, as log 2 6 = 3. If they adopt this solution, each one of A s networks can have up to 2 n (k 1 4) machines and each one of B s networks can have up to 2 n (k 2 3) machines. Recall that n is the total address length, and that the notation α k1 4,i corresponds to the i-th (sub)prefix of length k 1 4. If, for some reason, any of these networks has more machines than these limits, the only solution is to split it into two or more smaller networks. Another solution could be to merge some smaller networks, provided they existed, in order to reduce the total number of networks so that we could use a shorter sub-prefix, but this solution works only if we can reduce the total number of networks to a smaller power of 2. 28

29 Clearly there is a compromise between the number of networks (i.e., sub-prefixes) and the number of machines on each network. With the solution proposed above, A can add up to = 16 9 = 7 networks without changing the prefix lengths. Similarly, B can add up to = 8 6 = 2 networks. If, for some reason, A or B have reason to believe that the number of networks will grow more, they can use a longer sub-prefix for each network. The price they have to pay in this case is that there are fewer addresses in each sub-prefix. Longest Prefix Match When using flat addressing or a fixed prefix length, each destination address will match, at most, one entry in the routing table. In the case of flat addressing there are complete addresses in the left column of the routing table. In the case of a fixed prefix length there are prefixes but, even in the case of multiple (fixed) prefix length, like in the Internet before 1994, each address belongs to exactly only one prefix therefore there is never more than one matching entry. In the case of variable prefix length, instead, it is possible to have multiple matches. It is therefore essential to specify which entry should the router use in the case of multiple matches. One possible solution, and the one used in the Internet since 1994, is to use the longest prefix match. Whenever an address matches multiple prefixes the router should use the longest one. 5 Note that it is impossible for an address to match two prefixes of the same length, therefore the longest matching prefix is always unique. Thanks to this rule it is possible to specify routes to specific sub-prefix(es). In the network shown in Figure 2.5, there are three links crossing the boundary between the two ISPs, if each router could use only the whole prefixes α k1,1 and β k2, 2, all the traffic from that router to the other ISP would use the same inter-isp link. It is clearly preferable to be able to send part of the traffic on each link. One solution is to use all the shorter sub-prefixes (one for each network) but clearly this is just as undesirable given that it is exactly the solution used with the fixed prefix length. A better solution, instead, is to exploit the longest prefix match algorithm. Consider r 1 : it seems appropriate that it should send the traffic addressed to the networks n 14 and n 15 through r 9 and r 10, and possibly the traffic for n 11 as well. This can be easily accomplished by using just three entries in the routing table of r 1 : one with β k2,1 and i 1 and two with β k2 3,6, βk 2 3, 5 and i 3. This way it is possible to cover the whole β k2,2 prefix with just three entries, instead of 6 as in the case of fixed prefix length. On such a small network the gain might seem negligible but over a network as large as the Internet this can significantly reduce the size of the routing tables. At r 4, instead, we might prefer to route all the traffic for β k2,2 towards r 6 and only the traffic for β k2 3,1 towards r 5. Again this can be accomplished with only two entries, one with β k2,2 and i 3 and one with β k2 3,1 and i 2. We have previously mentioned that at r 1 we might be interested in routing all the traffic for n 15, n 14 and n 11 towards r 10, this can be accomplished, as mentioned above, 5 The prefix of length 0 is used to indicate the default route as it matches any address. 29

30 using one extra entry for each sub-prefix but, as the number of networks in our target groups grows so would the number of entries we need to add to the routing table. A better solution is to use a common sub-prefix for these networks. For example, we could allocate a sub-prefix of length k 2 3 to the three networks and then a sub-sub-prefix of length k 2 5 to each one of the networks. Clearly the sub-sub-prefix of length k 2 5 must belong to the specific sub-prefix of length k 2 3 that we had decided to use for these networks. Each ISP or whatever institution/company running the network can decide to partition the assigned prefix in any way it wants, the only constraint is that each group of addresses has to be a power of 2: no matter what the prefix length is, the number of addresses with the same values in the first k bits is always a power of 2. Note the it is also possible for each router to see different prefixes. For example, in the network shown in Figure 2.5, r 3 would use only the two longer prefixes: α k1,1 associated with the interface i 2 and β k2,2 associated with i 1. While, as mentioned above, r 1 and r 4 can use shorter prefixes as well. Recall that, throughout this chapter, we are assuming that the address size is fixed and equal to n. As an organization cannot use on its network addresses which are outside the prefix it has been allocated, it can use only n k bits to handle any further subdivision. 6 In the case of IP there is also the extra constraint that in each address block (network) it is not possible to use the address with all ones and all zeros in the host part. Clearly the longer the address is, the larger the address space and the easier it is to efficiently allocate different prefixes, sub-prefixes, sub-sub-prefixes and so on. One of the advantages of IPv6 is that it uses addresses with 128 bits, resulting in an extremely large address space allowing a much greater flexibility in allocating prefixes than in IPv4, with potential benefits for the size of the routing tables used in the Internet. 2.4 Routing Table Consistency In the previous section, we have looked at a few specific examples of routing tables, considering one router at a time. Routers do forward packets independently as each router consults only its local routing table whenever it has to forward a packet. It is nonetheless important to underline the fact that, while each router operates independently, the routing tables have to be consistent with each other. Consider the following example using the network shown in Figure 2.5 : whenever r 2 receives a packet for n 15, it forwards it to r 6 ; whenever r 6 receives a packet for n 15, it forwards it to r 4 ; whenever r 6 receives a packet for n 15, it forwards it to r 2, forming a routing loop. In this case, whenever a packet enters in this loop it will never exit (as long as the routing tables are not changed). In the case of IP, the packet will eventually be discarded when the Time To Live (TTL) field reaches zero. This field is initialized to a finite positive value by the sender and decremented by one by each router forwarding the packet. 6 Certain (larger) organization and ISPs in particular often receive multiple prefixes and they can administer each one of them as they prefer. 30

31 In the case of Ethernet, instead, the packet would indeed loop forever as Ethernet does not have a TTL field or any equivalent mechanism. For this reason Ethernet networks use the spanning tree algorithm to make sure that packets are always forwarded along a tree: instead of using all the links in the network, the spanning tree algorithm selects a subset of the links that form a tree covering (spanning) all the nodes and only these links are used to carry traffic. In this case packets will never loop as long as each intermediate node does not forward a packet on the same interface where it was received. This condition can be enforced locally by each node and does not require any global knowledge. In most IP networks, routers use a routing protocol (see the chapter about IP ) to share routing information and to choose loop-free routes. In the, somewhat rare, cases where routing tables are configured by hand, great care must be taken to make sure that there are no loops in the routes and the packets will indeed follow the desired root. Even for small networks this becomes quickly intractable as the number of routes to consider grows with the square of the number of prefixes (networks): for example in the network of Figure 2.5 there are 15 (sub)networks for a total of 225 source-destination pairs IP Prefix Allocation The situation shown in Figure 2.4, with neighboring networks with non-contiguous prefixes was not uncommon in the Internet, where, until 1994, prefixes were often allocated to entities like universities, research centers and companies that could be described as edge networks (or leaf networks ) as they are typically at the edge of the Internet and do not usually interconnect other networks. Furthermore no geographical consideration played a role in how prefixes were assigned, for example the 140.8/16 prefix could have been allocated to an organization in Germany while 140.7/16 and 140.9/16 could have been allocated to organizations in a different country or even on a different continent. While geographical proximity does not necessarily imply network proximity, it is extremely unlikely that networks on different continents will be next to each other network-wise. Since 1994, prefixes are not allocated directly to edge networks, instead the IPv4 address space has been divided in several fairly large blocks (typically /8) which have then been assigned to 4 regional registries: AFRINIC (African Network Information Center) APNIC (Asia Pacific Network Information Centre) ARIN (American Registry for Internet Numbers) (North America) LACNIC (Latin America and Caribbean Network Information Centre) RIPE NCC ( Réseaux IP Européens Network Coordination Centre) (Europe, Middle East, Central Asia) The IANA website has a list 7 of all the prefixes allocated to the different regional registries (and a few large corporations/institutions that had acquired these prefixes before 1994). 7 Available at 31

32 Internet ISP A ISP B ISP C n 1 n 4 ISP D n 7 n 2 n 3 n 5 n 6 Figure 2.6: An example where all the customers of ISP D (i.e., n 5 and n 6 ) have to change their addresses if ISP D changes its own ISP from B to A. The regional registries assign the addresses they have received from IANA to large ISPs and then each ISP allocates them to its own customers, which can be other (usually smaller) ISPs and so on until the final customers. The number of organizations which allocate prefixes between the regional registry and the final customer is not fixed and, in some cases, a institution/company can use directly the prefix it has received from the registry, without any further delegation. In the current scheme, as opposed to what happened before 1994, the ISPs control the addresses and a leaf network typically has to change its addresses if it changes ISP. This problem is compounded in the case where an ISP, which has not received its addresses directly from the regional registry, changes its own ISP forcing its customers to change address as well. Figure 2.6 shows such an example: if ISP D changes its own ISP from B to A, its customers n 5 and n 6 have to change all their addresses as well given that now D will use one of prefixes allocated to A instead of one of the prefixes allocated to B. 2.5 Different Message and Address Types While in RES 302, and throughout this chapter, we consider only unicast addresses and messages (with a few words about the broadcast case), it is useful to briefly mention the definition of the different address (and message) types: Unicast: the message is addressed to a single specific node. Broadcast: the message is addressed to all the nodes in the network (the sender being on the network in question). In the case of IP we talk mainly about limited broadcast where the message is addressed to all the nodes in a subnet (prefix). In the case of limited broadcast is possible, at least in theory, for the sender to be on a different network. Multicast: the message is addressed to a subset of nodes. Typically nodes need to register their interest in receiving the messages addressed to a given multicast 32

33 address. An example is a video-conference with more than two participants (the video-conference corresponds to a unique multicast address and all the participants need to explicitly register in order to receive the packets belonging to the videoconference). Another example is the distribution of a video flow, like a TV channel. Note that the network operator needs to deploy the infrastructure needed to support multicast. Anycast: the message is addressed to any one node belonging to a specific group. An anycast message is supposed to be delivered to only one node, typically the closest one to the group among all those belonging to the group. An example is DNS: when a host needs to resolve a name it could send the request to the anycast address corresponding to the DNS service, instead of sending the request as a unicast message to a specific DNS server. This way there is no need to configure the IP address of a DNS server in each host, furthermore if a new DNS server is deployed closer to the host there is no need to change the configuration of the host. Just like in the multicast case, the network operator needs to deploy the infrastructure needed to support anycast services. The interest reader is referred to RFC 4786 [1] for a discussion on the operation of anycast services. 33

34 3 The Network Layer (IP Protocol) 3.1 Introduction The main goal of the network layer is to deliver packets between end-nodes. In the case of the Internet Protocol (IP), the end-nodes are called hosts and are uniquely identified by a fixed-length address (32 bits in the case of IPv4 and 128 bits in the case of IPv6). As the its name suggests, IP can connect different networks (inter-networking). In an IP network, routers (sometimes called gateways) interconnect these different networks and are in charge of routing packets from the source host to the destination host. As explained in Chapter 2, each router uses its forwarding table to decide to which of its neighboring routers it should forward each packet, until the packet reaches the router directly connected to the destination host. IP routers are also in charge of fragmenting IP packets whenever these are too large to be sent on a certain link. The IP layer in the destination host is in charge or re-assembling all the fragments of an IP packet, before delivering it to the transport layer. The IP layer offers and unreliable service: packets can be lost, duplicated and delivered out of order. It does not offer any congestion or flow control service either. While it is possible to configure additional mechanisms on most routers to offer Quality of Service (QoS) guarantees, this is still fairly rare in the public Internet where routers only offer a best effort service which comes with no QoS guarantees whatsoever. While there are many similarities between IPv4 and IPv6, the two protocols do have several differences; in the interest of simplicity, the rest of the chapter will deal only with IPv The IP Protocol Figure 3.1 shows the IPv4 header, as defined in RFC 791 [27]. It has the following fields: Version (Ver.)(4 bits): the version of the protocol, four in the case of IPv4. IP Header Length (IHL) (4 bits): the length of the header in words of 32 bits. This value corresponds to the offset of the IP payload from the start of the header. Note that padding is used to ensure that the header length is always a multiple of 32 bits. The smallest possible value for this field is 5 (i.e., 20 bytes). Type of Service (8 bits): originally meant to distinguish between different QoS classes. With each class corresponding to different QoS parameters, like delay, throughput, reliability. RFC 2474 [24] and RFC 3168 [30] have the latest format of this field: the first six bits are the Differentiated Services Field, containing the code 34

35 32 bits Ver. IHL Type of Ser. Total Length Identification 0 F L Fragment Offset Time To Live Protocol Header Checksum Source Address Destination Address Options Padding Figure 3.1: The IPv4 header. corresponding to the QoS class of the packet (RFC 2474); the last two bits are used for Explicit Congestion Notification (ECN)(RFC 3168), which can be used to signal to a TCP sender that it should reduce its rate. Routers supporting ECN can set one of the two bits in the case of congestion, this bit will be copied by the destination in the acknowledgments so that the source will be aware of the congestion and reduce its sending rate. Both Differentiated Services and ECN are outside the scope of RES 302 and we will not delve any further into them, the interested reader is referred to RFC 2474, RFC 3168 and related RFCs. For a brief history of the ToS field see section 19 of RFC 2481 [29]. Total Length (16 bits): the total length of the IP packet, including the header. The maximum value is This is the total length, therefore the maximum payload size is smaller and depends on how many options (if any) are used. RFC 791 states that all hosts must accept IP packets of at least 576 octets. Note that this is not the minimum packet size. It just implies that all IP hosts must be capable of receiving (and handling) packets with up to 576 octets. Identification (16 bits): an identifying value for each packet. This value is chosen by the sender and it can be used to identify fragments belonging to the same packet (see below). Flags (3 bits): contains three flags: Bit 0: reserved, must be 0; Bit 1 (F): whether the packet can be fragmented or not (0=May Fragment, 1=Do Not Fragment); Bit 1 (L): whether this packet is the last fragment of a packet which has been fragmented before reaching the destination (0=Last Fragment, 1=More Fragments). 35

36 Fragment Offset (13 bits): the offset from the beginning of the first fragment of the current IP packet. The offset is measured in terms of 8 octets (64 bits). Time to Live (TTL) (8 bits): the sending host must initialize this field with a non-zero value. Every router forwarding the packet decrements its value by one. If it reaches zero, the packet must be discarded. This it to make sure that packets will eventually be destroyed even in the case of routing loops. Recall that it is possible to have routing loops in an IP network. Obviously this is undesirable but can be the result of a mistake in the configuration of a router. Loops can also temporarily appear when routers are propagating new routing information, for example following a topological change due to a link failure or a link recovery. Originally, this field was meant to represent the maximal lifetime of a packet in seconds, hence the Time To Live name. Given the difficulty of accurately measuring the life time in seconds of each packet, this field is now used to express the maximum number of hops that a packet can make (i.e., the maximum number of routers it can traverse). Typically each Operating System has a default value for the TTL of new packets, usually 64, which is the value recommended by IANA 1. Protocol (8 bits): the code indicating the protocol of the payload. Typically this is a layer 4 protocol but not always, for example it is possible to have an IP packet inside another one (this is usually used in IP tunnels ). Initially these codes were specified in a separate RFC ([26]) which has been updated by many other RFCs. Since 2002 IANA maintains an online database with the latest version of all the codes at http: // (see RFC 3232 [34]). Some examples: 6 corresponds to TCP, 17 to UDP and 1 to ICMP (all these values are in decimal notation). Header Checksum (16 bits): the checksum computed on the header only. See RFC 791 (p. 14) for the details of the algorithm. Source Address (32 bits): the address of the source. Destination Address (32 bits): the address of the destination. Options (variable size): as the name suggests, this field is optional (its presence can be deducted by the header length). Options can be used for source routing and timestamping among others. The complete and updated list of valid options is available on the IANA website: IP Fragmentation Figure 3.2 shows an example of packet fragmentation. The Maximum Transmission Unit (MTU) is 9000 bytes on the link between the source and the first router. Therefore the source can send a packet (P 1 ) with 1600 bytes in the payload (the example assumes that no IP options are used in the packet so that the header is 20 bytes). The outgoing 1 See 36

37 MTU 9000 B S Len P Id 341 F 0 L 0 Off. 0 Payload len=1600 MTU 1500 B R1 Len P Id 341 F 0 L 1 Off. 0 Payload len=1480 Len P Id 341 F 0 L 0 Off. 185 Payload len=120 R2 Len Id F L Off. Payload Len Id F L Off. Payload P len=488 P len=488 MTU 512 B Len P Len Id 341 Id F 0 F L 1 L Off. 122 Off. Payload len=488 Payload Len P 7 36 Id 341 F 0 L 1 Off. 183 Payload len=16 P len=120 D Figure 3.2: An example of packet fragmentation in IPv4. link of the first router has an MTU of 1500 bytes, forcing the router to split the packet in two. This operation is permitted as the F flag in the header is 0, indicating that routers can fragment the packet. The first packet (P 2 ) contains the first 1480bytes of the payload and the second packet (P 3 ) the remaining 120. Both packets have the same ID (341) as the original packet, so that the receiver can re-assemble them. The second packet (P 3 ) has the F flag set to 0, indicating that it is the last fragment. The offset field in P 3 is set to 185, indicating that the offset of the first byte in the payload is = 1480, this is because the offset is expressed as a multiple of 8 bytes. The MTU for the link between the second router and the destination is only 512 bytes, forcing the second router to create even more fragments. It divides P 2 into four packets: three (P 4, P 5, P 6,) with a payload of 488 bytes each and one (P 7 ) with a payload of 16. The total length of the first three packets is 508, which is less than the MTU. This is because a packet with a total size equal to the MTU (512 bytes) would have a payload size of 492 bytes, which is not a multiple of 8. The largest multiple of 8 corresponding to an acceptable payload length is 488. (Recall that we are assuming that the are no IP options so that the header size is 20 bytes.) The source has all the information need to reassemble the fragments: all fragments have the same ID, each one carries the offset, indicating the position of the fragment in the original packet. Note that destination combines the ID with the source address, the destination address and the protocol field in order to decide which fragments belong to the same packet. This prevents ambiguities even if different senders choose the same ID at the same time. 37

38 3.2.2 IPv4 addressing As we have already seen in Chapter 2, IP was developed as a mean to connect several existing networks into one larger inter-network. These existing networks were such that all the computers connected to the same network could already communicate directly with each other. Networks used different protocols and the role of IP was to be a network layer protocol shared by all these networks, so that any two computers could communicate with each other using IP, irrespective of the network they belonged to. These networks are sometimes called sub-networks to emphasize the fact that they are all connected in order to form a single, larger, network. It is customary to think about these sub-networks as layer 2 networks, implying that these networks differs only as far as layers 1 and 2 are concerned. A typical example is an Ethernet network: by definition such a network implements only layers 1 and 2. Layers 3 and 4 are provided by IP and companion protocols (TCP and UDP at layer 4). While it is useful to think about this ideal case and while we will only consider this case in RES 302, real networks can be much more complicated and might have multiple layer 3 protocols. For example some of these sub-networks can use Asynchronous Transfer Mode (ATM) networks which is a layer 3 protocol. 2 At the same time, it is always true that all the computers connected to a given sub-network can communicate directly using a certain protocol, which we can call the sub-network protocol, whatever its layer. Therefore IP can use the services offered by the sub-network protocol in order to send and receive data between any two nodes belonging to the sub-network. For example, in the case of an ATM subnetwork, IP can use the ATM protocol to exchange packets directly between any two computers connected to the ATM subnetwork, even though ATM happens to be a layer 3 (and 4) protocol. Similarly, in the case of an Ethernet sub-network, IP can use the services provided by Ethernet to exchange packets directly between any two nodes on the same Ethernet sup-network. The commonly used expression to communicate directly refers to two nodes connected to the same sub-network that can communicate using the sub-network protocol without the intervention of an IP router. In other words there are no IP routers inside each sub-network. Hence the IP packets exchanged between two hosts on the same sub-network will not go through any IP router. This is not surprising, as the role of an IP router is exactly to connect different sub-networks. In this context, it is natural to assign a different IP prefix to each of these (sub)- networks: all the computers on the same network have IP addresses that share the same prefix. As explained in Chapter 2, this was first achieved by using three different prefix lengths with the corresponding address classes. Since 1994 prefixes can be of arbitrary length but the principle is always the same: all the IP hosts on the same sub-network share the same prefix and must be able to communicate directly at layer 2. To be precise the smallest and largest address in each prefix cannot be assigned to specific interfaces: the largest address (with all the bits in the host part set to one) is reserved as the broadcast address and it is meant to represent all the interfaces on a 2 To be precise ATM is a complex set of protocols offering layer 3 and 4 services, but this is outside the scope of RES

39 A B Ethernet 45.26/16 R Ethernet /24 C R Ethernet /24 PPP /30 D R R PPP / ATM 25.45/16 Ethernet / E F Figure 3.3: A simple IP inter-network. certain subnet. Similarly the smallest address (with all the bits in the host part set to zero) is reserved to identify the subnet itself. 3 Therefore if a prefix of length k has been assigned to a subnet, only 2 n k 2 addresses can actually be used (where n is the length of the addresses). This is way the longest possible prefix length in IPv4 is 30 and not 31: a prefix of length 31 has only two addresses and both of them are reserved; a prefix of length 30, instead, as four addresses and two of them can be used. In an IP network, even point-to-point links, which, by definition, connect only two interfaces, must use a prefix of length at most Figure 3.3 shows a simple inter-network, that is a collection of different networks connected using IP routers. The IP protocol is the common language spoken by all the end-nodes in each of the sub-networks. This way any two nodes in the inter-network can exchange data using the IP protocol. Note that this is exactly what the Internet is, the only difference is that the number of sub-networks in the Internet is extremely large. For example, host A can communicate with host B as well with host D by using the IP protocol, even though the former is on the same sub-network while the latter is not. The IP layer in each host will handle the communication correctly. As far as layer 4 is concerned, there is no difference in exchanging data between A and B or A and D. 5 3 As noted in RFC 3021 [33], the address with zeros in the host part is an obsolete form of broadcast address. See RFC 1812 [2] and RFC 922 [21] for more details about these addresses. 4 Actually RFC 3021 [33] introduced prefixes of length 31, but this is outside the scope of RES As you have seen in the socket programming exercises, you only need to specify the IP address (and 39

40 Handling Inter and Intra Sub-Networks Communications While the upper layers (Transport and above) are not aware of the location of the other communication end-point, the IP layer must handle the differences in the communication, depending on whether the destination is on the same IP-subnet or not. If the destination of an IP packet is in the sub-net as the sender, it means that the two nodes can communicate directly at layer 2: the sending node can encapsulate the IP packet in a layer 2 frame whose destination address is the layer 2 address of the destination. Such a frame will reach the destination without going through any IP router. RFC 1122 [5] calls this a local delivery. If the destination is not in the same IP subnet, the packet must go through one or more IP routers in order to reach the destination. In this case the source must send the packet to one of the IP routers connected to its subnet. Obviously any IP subnet must be connected to at least one router in order to be part of an inter-network. All IP nodes know the IP address of at least one router connected to the same subnet. This implies that the address of the node and the address of the router must have the same prefix, so that the node can communicate directly at layer 2 with the router. Therefore the source can encapsulate the IP packet insider a layer frame whose destination address is the layer 2 address of the router. The router will then consult its routing table and forward the packet accordingly. RFC 1122 calls this case remote delivery. If there are multiple routers connected to the same subnet, the source will not always contact the best placed one (i.e., the one closest to the destination). Whenever a host has a packet for a node that is not on the same subnet, it will first forward it to the default gateway. In some cases, it is possible that the default gateway will then forward the packet to one of the other routers in the same subnet, let R d be the first (default) router receiving the packet and let R o be the router best placed to reach the destination among all those connected to the same subnet. In this case it would obviously more efficient for the source to send packets directly to R o, to this end R d can send an ICMP redirect message (see section 3.2.5) telling the source to send the packets directly to R o. In this case the source should store this information in a local cache. Consider, for example, the network shown in Figure 3.3, it is reasonable for D to use either R 1 or R 3 as its default gateway, rather than R 4. Suppose that R 3 is the default gateway for D. In this case, if D has a packet for F, it will forward it R 3 which will then forward it to R 4. In this case R 3 will also send an ICMP redirect message to D, informing it that F is reachable via R 4. 6 One problem with this solution is that ICMP redirect message are not secure. They are nor encrypted and the receiver cannot be sure of the identity of the sender: any host can send a redirect message, potentially intercepting the traffic of other nodes. This is why redirect messages are not often used. So far we have discussed the difference between local and remote destination nodes but we have not yet seen how the source can distinguish between the two cases. This can the port number) corresponding to the other end of the socket, regardless of whether the destination is on the same sub-network or not. 6 See section of the RFC 1122 for a detailed description. 40

41 be easily accomplished by the following algorithm: 1. The source performs a bitwise AND between its own netmask and the destination IP address; 2. The source performs a bitwsie AND between its own netmask and its own IP address. 3. If the results of these two operations are equal the destination is in the same subnet; if they are not, it is in a different subnet. Algorithm 1 The pseudocode presented in RFC 950 to distinguish between local and remote destination addresses. if bitwise and(dg.ip dest, my ip mask) = bitwise and(my ip addr, my ip mask) then send dg locally(dg, dg.ip dest) else send dg locally(dg, gateway to(bitwise and(dg.ip dest, my ip mask))) end if Algorithm 1 shows the pseudocode proposed by RFC 950 [22] (section 2.2) for sending IP packets. As IP packets are sometimes called IP datagrams, the dg in the pseudocode stands for datagram. The last problem to solve is how the source can find the layer 2 address of the destination (in the case of local delivery) or the layer 2 address of the gateway it has decided to use (in the case of remote delivery). Note that in both cases the layer 2 address belongs to a node (end-host or router) connected to the same layer 2 network as the source. In the case of Ethernet network the source can use the Address Resolution Protocol (ARP), which is covered in the next section. What we have described so far is what existing RFC prescribe for end-nodes. That being said, most modern Operating Systems use a slightly different solution: each node has a routing table (just like IP routers) and it uses the longest prefix match algorithm to determine where to send outgoing packets. The details of this solution are outside the scope of RES The Address Resolution Protocol (ARP) As explained in the previous section, nodes (either hosts or routers) can use the Address Resolution Protocol to find the MAC address corresponding to a certain IP address. The basic idea behind the protocol is fairly simple: suppose that two nodes A and B are on the same IP subnet, that their IP addresses are and respectively and that their MAC addresses are m A and m B. If A wants to send an IP packet for B it needs to find B s MAC address, as discussed in the previous section. In this case A sends a broadcast (at layer 2) message asking who has the IP address (ARP calls this the target IP address). This is an ARP request message, which contains 41

42 32 bits Hardware Type Protocol Type Hardware Address Size Protocol Address Size Operation Code Sender Hardware Address Sender Hardware Address (cont.) Sender Protocol Address (cont.) Sender Protocol Address Target Hardware Address Target Hardware Address (cont.) Target Protocol Address Figure 3.4: The Ethernet header and the ARP packet format for Ethernet networks. the target IP address as well as the IP and MAC address of the sending node ( and m A in the example). Given that the ARP request is a broadcast message, all nodes on the layer 2 network receive it. Each node compares the requested IP address with its own and the station which recognizes its own IP address will reply with an ARP reply message, which is a unicast message to the node which sent the ARP request. Obviously the ARP reply contains the MAC address of the destination but it also contains its IP address, as well as the MAC and the IP address of the node that sent the ARP request. In our example, B would receive the ARP request, recognize its IP address ( ) as the target address and send an ARP reply message with its MAC address (m B ). In this case B would also learn A s MAC address, as this information is included in the request, as well as A s IP address. This way B does not need to generate a new ARP request if it has a packet for A. As communications are often bidirectional, it is reasonable to assume that B will send IP packets to A. Figure 3.4 shows the ARP packet format for Ethernet networks, as specified by RFC 826 [25]: Hardware Type (2 bytes): a code indicating the layer 2 protocol used, the value for Ethernet is 1. A first list of the available codes appeared in RFC 1060 [35] but that RFC is now obsolete and the up-to-date list of the codes can be found on the IANA website. 7 Protocol Type (2 bytes): a code indicating the layer 3 protocol used. ARP can be used with protocols other than IP, hence the presence of this field. The codes used are the same as those used in Ethernet frames to indicate the layer 3 protocol. These

43 codes are often called Ethertype and they are administered by the IEEE. 8 The code for IPv4 is 2048 in decimal notation, which corresponds to 800 in hexadecimal notation. Hardware Address Length (1 byte): the length, in bytes, of the hardware (i.e., layer 2 ) addresses. The value for Ethernet network is 6, corresponding to 48 bits. Protocol Address Length (1 byte): the length, in bytes, of the network layer addresses. The value for IP network is obviously 4, corresponding to 32 bits. Operation Code (2 bytes): a code indicating the requested operation. RFC 826 defines only two values: 1 for requests and 2 for replies. Currently (2012), other operation codes have been defined, see the IANA website for details. Sender Hardware Address (length specified in the Hardware Address Length field): the hardware address (MAC address for Ethernet networks) of the sender of the packet. Sender Protocol Address (length specified in the Protocol Address Length field): the protocol address (IP address for IP networks) of the sender of the packet. Target Hardware Address (length specified in the Hardware Address Length field): the hardware address (MAC address for Ethernet networks) of the recipient of the packet. Target Protocol Address (length specified in the Protocol Address Length field): the protocol address (IP address for IP networks) of the recipient of the packet. Note that the ARP packet cannot be encapsulated in an IP packet as the MAC address of the destination is not yet known when an ARP request is sent. Therefore it is encapsulated directly in an Ethernet frame with protocol type set to 2054 in decimal notation, corresponding to 806 in hexadecimal notation. If no machine in the layer 2 network is currently using the target IP address, or if the machine using that address is not turned on, no ARP reply message will be generated. Most Operating Systems will send again a few times an ARP request when they do not receive a response after a certain time. If no responses are received, the Operating System will declare the target IP address as unreachable and generate the corresponding error condition. Hosts can also use the ARP protocol to determine whether an IP address is already in use: if a host sends and ARP request for a certain IP address and if it does not receive any reply, it can conclude that, at least for the time being, that IP address is not in use. The DHCP standard (RFC 2131 [17]) suggests using ARP requests to verify that a certain IP address is not already in use and RFC 5227 [6] clarifies further this mechanism and its parameters (timeouts, number of attempts, etc.). It worth mentioning that in IPv6 ARP has been replaced by a similar protocol, called Neighbor Discovery Protocol (NDP) (see the RFC 4861 [23]). 8 See for an up-to-date list. 43

44 ARP Cache As hosts typically exchange more than a single IP packet, it is useful to store the information obtained thanks to ARP requests and responses in a local cache. RFC 826 suggests storing the relationships between IP and MAC addresses in a table. It also mentions that, as already noted above, the node replying to an ARP request knows the IP and MAC address of the sender and, therefore, it should store this information in its cache. It is often the case that when one uses caches to store information for a later use, one has to address the problem of the validity of the cached information as time goes by. This problem is typically addressed by using timeouts and by removing the information once the timeout has expired. The value of this timeout and, more in general, the cache update mechanism are not specified in RFC 826 and are implementation specific. It is worth mentioning that on several Operating Systems (including Unix-based systems), one can use the arp command to show the local ARP cache and to modify it Dynamic Host Configuration Protocol Based on what discussed in the previous sections, it should be clear that an IP host must be configured, at the very least, with an IP address and the associated netmask (i.e., prefix length) in order to be able to to exchange IP packets with other IP hosts on the same layer 2 network. In order to communicate with IP hosts on other subnets, an IP host needs also to know the IP address of at least one router connected to the same subnet as the host. This router is often called the default gateway. Finally, in order to use the DNS, a host needs to know the IP address of at least one DNS server. One possible solution is to configure all these parameters by hand on each node. Obviously this is a tedious and error prone procedure, which is rarely used. A few protocols have been proposed to address this problem. One of the first attempts was the BOOTP protocol, presented in RFC 951[8], this protocol is aimed mainly at diskless computers that need not only to configure the parameters of the IP stack but also to download an image of the Operating System from a server, hence the name BOOTP, which stands for bootstrap protocol. The most widely used protocol for automatic configuration is the Dynamic Host Configuration Protocol (DHCP), which started as an extension to the BOOTP protocol (see RFC 1531 [9]). RFC 2131 [10] contains the current version of DHCP. DHCP can be used to set several parameters in each hosts, these include, but are not limited to, the following: IP address; netmask (i.e., prefix length); default gateway; IP address(es) of (a) DNS server(s); DNS name of the host; 44

45 domain name for the host (e.g., telecom-bretagne.eu; default IP Time To Live; interface Maximum Transmission Unit (MTU); print server. DHCP is based on a client server model, where one (or more) server(s) contain the configuration information for the hosts in the network. It is possible to differentiate the configuration on a per-host basis. This can be accomplished, for instance, by using the MAC address to identify each host. (In this case the configuration information is associated with the MAC address.) DHCP supports three different methods for allocating IP addresses, in each case the server is in charge of a range of IP addresses and it assigns these addresses to the clients following difference policies: dynamic allocation: the server allocates an IP address to a client for a certain time. The term lease is often used in this case. Clients can request a renewal of the lease at the and of each lease period. In this case it is also possible to have more clients than IP addresses, as long as not all the clients are connected at the same time. (This can be the case for certain ISPs, especially those whose customers use dial-up modems.) automatic allocation: the server allocates the same IP address to the same client. The first time a client requests an address, the server selects an available IP address and stores the MAC address of the client and the corresponding IP address. This way when a previously known client will request again an IP address it receives the same. manual: the network administrator statically assigns IP addresses to MAC addresses. This is usually done by editing a configuration file. One of the advantages of this solution is that only known hosts will be able to obtain an IP address from the DHCP server. At the same time note that this is a very low level of security as it is possible to change the MAC address of a host, therefore a malicious user needs to know only one MAC address of the authorized hosts in order to force the DHCP server to reply. When a host needs to configure its IP interface, it sends a broadcast message, called DHCP DISCOVER, specifying the parameters it needs. Upon receiving this message the server replies with a DHCP OFFER message, offering, as the name suggests, an IP address and the associated parameters requested by the host. It is possible for multiple server to reply, in any case the client will chose only one of these offers and reply with a REQUEST message, addressed to the server it has chosen. This server will then reply with an acknowledgment message ( ACK ). At this point the client is configured and it can start using the network. DHCP is a fairly complex protocol and this section is meant only as a short overview. The interested reader is referred to RFC 2131 [10] for more details. 45

46 32 bits Type Code Checksum Data (variable format) Figure 3.5: ICMP message (the format of the data field depends on the type of message) Internet Control Message Protocol (ICMP) Even though, as previously mentioned, IP is not a reliable protocol, it is still useful for IP nodes to report certain error conditions. For example when an IP packet cannot reach its destination, either because it does not exist or because it is unreachable, or when a packet is discarded because the TTL has expired. IP nodes can use the Internet Control Message Protocol (ICMP) to inform each other of these errors. Routers can also use ICMP to inform hosts that a shorter (more direct) route is available to a certain destination going through a different router. As ICMP is a control protocol used by IP nodes to handle error conditions related to packet forwarding, it is traditionally regarded as a layer 3 protocol, even though ICMP messages are encapsulated inside IP packets, like layer 4 messages. The ping and traceroute programs, available on most Operating Systems, use the ICMP protocol as well, the first to determine whether IP packets can be successfully exchanged between two IP nodes and the second to determine the IP addresses of the routers handling the packets sent between two IP nodes. Figure 3.5 shows the format of a generic ICMP packet: the values of the type and code fields (1 byte each) distinguish the different ICMP messages. Table 3.1 shows some of them. 9 The checksum is calculated using the same algorithm as in IP (see RFC 791 for the details). The format of the data field depends on the type of message. In the case of error messages, like host unreachable or TTL expired, the data field includes the first 64 bytes of the IP packet that caused the error message to be generated. The ping utility The ping program uses ICMP echo request and echo reply message to determine whether IP packets can be successfully exchanged between two IP nodes and, in the case of success it computes the minimum, maximum and average Round Trip Time (RTT) and its standard deviation as well as the packet loss ratio. It is often used to diagnose network problems; if a node does not respond to ping requests it means that there is no IP connectivity between the two hosts. Using different 9 See for a complete list. 46

47 Type Code Description 0 0 Echo reply (ping) Destination network Unreachable 1 Destination host Unreachable 2 Destination protocol unreachable 3 Destination port unreachable 4 Fragmentation required but DF flag set 5 Source route failed 6 Destination network unknown 7 Destination host unknown 0 Redirect Datagram for the Network 1 Redirect Datagram for the Host 8 0 Echo request (ping) 11 0 TTL expired in transit 1 Fragment reassembly time exceeded Table 3.1: The Type and Code values for some of the ICMP messages target nodes, it is sometimes possible to determine where a problem is. For example, if a host cannot ping its own default gateway, it implies that it will not be able to reach computers located on different subnets. Note that, on most operating systems, it is possible to disable ICMP echo replies. Obviously if an IP node has been configured not to generate ICMP replies, it cannot be used as the target of a ping command, but it can still be used as the source. The following is an example of the output generated by the ping utility: Using ping: user@machinename :~$ ping -c 3 pc -df priv.enst - bretagne.fr PING pc -df priv.enst - bretagne.fr ( ): 56 data bytes 64 bytes from : icmp_ seq =0 ttl =56 time =5.115 ms 64 bytes from : icmp_ seq =1 ttl =56 time =4.999 ms 64 bytes from : icmp_ seq =2 ttl =56 time =4.988 ms --- pc - df priv. enst - bretagne. fr ping statistics packets transmitted, 3 packets received, 0.0 packet lossround-trip min/avg/max/stddev = 4.988/5.034/5.115/0.057 ms The ping program sends ICMP echo request messages to the destination specified on the command line. If the user gives a name instead of an IP address, ping uses the DNS to find the corresponding IP address. Figure 3.6 shows the format of an echo reply (and request) ICMP message. The identifier and sequence number are used to match replies to requests. The standard (RFC 792) does not specify what is the difference between the identifier and the sequence number but it does suggest that the Identifier can be used 47

48 32 bits Type Code Checksum Identifier Sequence Number Data (optional) Figure 3.6: ICMP echo request and echo reply message format. S R 1 R 2 R 3 D TTL=1 ICMP TTL expired error message TTL=2 TTL=1 ICMP TTL expired TTL=3 TTL=2 TTL=1 ICMP TTL expired TTL=4 TTL=3 TTL=2 TTL=1 ICMP port unreachable error Figure 3.7: How the traceroute program works between the nodes S and D. like a TCP (or UDP) port to identify a session, while the sequence number can be used to distinguish requests within each session. The data field is optional and, if present, the receiver must copy its content from the request to the reply message. The traceroute utility The traceroute program exploits the fact that IP routers should generate an ICMP TTL expired message whenever they discard a packet. By sending packets with increasing TTL values (starting with 1), traceroute can determine the IP addresses of the routers between the sender and the receiver. Figure 3.7 shows an example of how traceroute works: the source sends a UDP packet to the destination, using a destination port number greater than 33434, with a TTL value of 1. The first router (R 1 in the figure) decrements the TTL by 1 and then generates an ICMP TTL expired error message because the TTL has reached 0. The IP source address of the ICMP message is one of the IP addresses of the router. This 48

49 Sample traceroute output: :~$ traceroute pc -df priv.enst - bretagne.fr traceroute to pc -df priv.enst - bretagne.fr ( ), 64 hops max, 52 byte packets 1 mgs - salsa - i ( ) ms ms ms 2 asr - inside ( ) ms ms ms 3 * * * 4 te1-1 - stbrieuc -rtr noc. renater.fr ( ) ms ms ms 5 te1-1 - lannion -rtr noc. renater.fr ( ) ms ms ms 6 vl857 -pc1 - brest1 -rtr noc. renater.fr ( ) ms ms ms 7 telecom - brest - l3vpn - vl857 -pc1 - brest1 -rtr noc. renater.fr ( ) ms ms ms 8 c vlan36. enst - bretagne. fr ( ) ms ms ms 9 pc -df priv.enst - bretagne.fr ( ) ms ms ms Figure 3.8: Sample traceroute output. way the source knows one of the IP addresses of the first router. Then the source sends another UDP packet to the destination but this time with the TTL set to 2. This way the first router will forward the packet but the second will generate an ICMP TTL expired message. The process is repeated until the source receives an ICMP port unreachable error message, indicating that the destination did receive the packet. Not all routers do generate ICMP TTL expired messages, this is why the traceroute utility waits for 5 seconds before sending another packet with the same TTL value. If it still does not receive any ICMP message after three attempts, it moves to the next TTL value. If, after sending a message with a certain TTL value, traceroute does not receive an ICMP message after a certain time (usually 5 seconds), it uses an asterisks ( * ) to indicate a missing ICMP messages. Figure 3.8 shows a sample output generated by the traceroute utility. Note that the third router is represented by asterisks as traceroute did not receive any ICMP message for the packets it sent with TTL= Network Address Translation As discussed in Section 2.3.2, in the mid 1990s, the Internet switched to Classless Interdomain Routing (CIDR). One of the reasons for this change was the increasing demand for the allocation of IPv4 classes. Recall that, before CIDR, the IPv4 address 49

50 /24 A /24 X B /24 NAT /30 Internet S S Figure 3.9: A NAT example (X is the NAT device). space was divided into a fixed number of prefixes (blocks of consecutive IP addresses) and these prefixes had only three possibly lengths: 8 bits (class A), 16 bits (class B) or 24 bits (class C). Class C prefixes allowed for subnetworks with at most computers, such networks are fairly small and a limited number of institutions and business were interested in them. Class B prefixes, instead, allowed for networks with up to computers, which is enough for most networks. Because of this the demand for class B prefixes was very high, while the demand for Class C prefixes was much lower. CIDR improved this situation by removing the three fixed prefix lengths and introducing a more flexible addressing scheme, which also introduced the possibility of reducing the size of the routing tables by aggregating prefixes. (See Chapter 2 for more details.) At the same time, due to the growing success of the Internet, the demand for IPv4 addresses kept growing, moving forward the day when there would be no unallocated IPv4 prefixes. This motivated the search for solutions limiting the number of allocate IPv4 addresses. One such solution has been widely adopted and it is sometimes credited for the fact that the depletion of the IPv4 address space has been slower than expected. 11 The idea behind this solution is fairly simple: as quite a few networks, mostly with client computers, are connected to the Internet through a single router it is possible to hide the whole network behind that router using a single (public) IP address. This is called Network Address Translation (NAT). Figure 3.9 shows a NAT example: the computers connected to the network on the left of the device X are part of the private network. This network is connected to the Interned only through this device, which is charge of the address translation. Three prefixes have been reserved for private networks (RFC 1918 [32]): /8 (formerly a single class A prefix), /12 (formerly 16 contiguous class B prefixes), 10 Note that we are taking into account the fact that the first and last IP address in a prefix cannot be used, hence the maximum number of computers on a network with a prefix length k is 2 n k 2, where n is the address length. 11 As of the end of 2012, there are still some unallocated IPv4 prefixes but these are allocated with increasingly stricter rules, see, for example, the RIPE website internet-coordination/ipv4-exhaustion. 50

51 /16 (formerly 256 contiguous class C prefixes). These prefixes cannot be used on the public Internet. The private network in the figure uses the prefix. It is obviously possible to use sub-prefixes of these reserved prefixes, the choice is up to the network administrator. As the network in the figure is small, a prefix of length 24 is more than enough. As mentioned above, one of the ideas behind NAT is that most (often all) the computers in the private networks are client computers, as such they will only contact other computers and they will not be contacted. In other words all the connections are outgoing: even in the case of UDP exchanges, the first packet is sent from one computer in the private network to a computer on the Internet and not the other way around. Recall that one of the definitions of client is that a client uses the services offered by a server computer. This is usually accomplished by the client sending a request to the server or by opening a connection with the server. The crucial point is that a server computer must be able to accept incoming requests and/or connections. This important fact is exploited by the NAT. Outgoing packets are straightforward to process: it is enough to overwrite the private IP address (used as the source address) with the public IP address. Incoming packets present one extra challenge: the NAT must decide which machine in the private network is the intended destination. As private IP addresses cannot be used in the Internet, all incoming packets have the public IP address of the NAT as their destination address. The NAT must then decide to which internal machine forward each incoming packet To solve this problem, the NAT takes advantages of the fact that there will always be an outgoing packet first, as long as all the machines on the private network are clients. When the NAT forwards the first outgoing packet of a new exchange (TCP or UDP) it writes in a table the private source address and the source port. It then selects a new (and unique) source port number (called public local port in the following) and writes this value in the table as well. It then overwrites the source port and the source IP address with the public local port and its public IP address. When the NAT forwards other outgoing packets belonging to the same connection it uses the table to determine the value of the public local port and it overwrites the source port and IP source address. Note that the NAT can also decide whether an outgoing packet is the first one of a connection: if there is no line in the table with the same private local port ad private source IP address the packet belongs to a new connection. Otherwise if there is already a line in the table with the same values for the the local source port and source IP address, the packet belongs to an existing connection and the table contains the value of the public local port. As the NAT overwrites port numbers and not only address, a more appropriate term is Network Address and Port Translation (NAPT), but this term is rarely used. Whenever the NAT receives an incoming packet it extracts the source IP address and destination port. It then looks in the table for the line withe the same public local port and overwrites the destination IP address and destination port with the private IP address and private local port in the table. It is important to remark that the NAT must also recompute, and overwrite, the checksum fields of the IP and TCP (or UDP) headers 51

52 Local IP address Local private port Local public port Table 3.2: The Type and Code values for some of the ICMP messages for all the packets it modifies, given that these checksum cover the IP addresses (source and destination) and ports (source and destination). Note that the NAT must chose a unique value for the public port, that is each connection listed in the table must have a different public port. Otherwise it would not be possible to decide the internal destination of incoming packets. To be precise, it is enough for the three-tuple (public local port, public IP address, remote port) to be unique, but in this case the NAT should also store the remote port number and the remote IP address. Another option is to guarantee the uniqueness of the pair (public local port, remote IP address), in this case it would enough for the NAT to store the remote IP address. Consider the example in Figure 3.9, suppose that A wants to open a TCP connection with S 1, for example because S 1 is a web-server. In this case A sends a TCP SYN packet with source address and destination address , source port 8932 and destination port 80. Recall that the source port is chosen, among the ephemeral ports, by the operating system (see Section 1.3.2). The NAT looks in its table to see if there is already a line with address and private local port As this is the first packet of a connection, such a line does not exist and the NAT adds a new line with this information. It also selects a new public port, say Before transmitting the packet, the NAT overwrites the source IP address ( ) with its public IP address ( ) and the source port with the local public port (9758). Assuming that everything is working correctly, the packet reaches S 1, which it is going to reply, sending a packet with source address , destination address , source port 80 and destination port This packet is routed to the public interface of the NAT, which looks in the table for the line with public local port Such a line does exist and it contains the local IP address ( ) and the private local port (8932). Therefore the NAT forwards the packet on the private network, overwriting the destination IP address with and the destination port with Using this mechanism, it is possible for two computers on the private network to establish a communication with the same destination port on a remote computer. For example, B can open as well a TCP connection with the port 80 of S 1. In this case it will send a TCP SYN packet with IP source address , destination address , source port 4586 and destination port 80. The NAT realizes that this is the first packet of a new connection and selects a new (unique) public local port, for example 6857, and it modifies the packet accordingly. This packet is routed to S 2, which replies with a packet with source address , destination address , source port 80 and destination port When the NAT receives this packet, it checks in its table 52

53 and it finds that the local public port correspond to the local private port 4586 on the node. It overwrites the destination address (with ) and the destination port (with 4586) and it forwards the packet to the destination. Table 3.2, shows the content of the NAT table for these two connections as well as a third one, for example between A and S 2. The mechanism described so far works only for outgoing connections. In the case of incoming connections, the NAT cannot use its table to decide to which of the internal computer it should forward the packet. Most NAT implementations do allow the administrator to specify to which internal computer the NAT should forward incoming packets addressed to a certain port. This solution allows having server computers on the private network. But there can be only one server for a given port. As NAT is usually deployed in networks where there are no servers, this is not a problem. A more serious problem with NAT is that it cannot easily cope with application layer protocols that use IP address (and/or port numbers) in their messages. As TCP/IP uses an imperfect layering, TCP (and UDP) communication end-points use IP addresses. Several application layer protocol use IP addresses in their messages as well: notable examples are the File Transfer Protocol (FTP RFC 959 [28])and the Session Initiation Protocol (SIP RFC 3261 [36]). NAT boxes can handle these protocols correctly if they inspect the payload of each packet to determine the application layer protocol and whether it contains IP addresses or port numbers that need to be modified. While this solution is viable, it does increase significantly the complexity of the NAT implementation. Furthermore the deployment of new application layer protocols requires modifying existing NAT implementation whenever such protocols use IP addresses and port numbers. This is not the case for normal IP routers, which do not need to be modified when new application layer protocols are introduced, even if they use IP addresses and port numbers in their messages. Another difference between routers and NAT boxes it that routers do not keep any state information, they simply route each packet independently of all the others. NAT boxes, on the contrary, must keep track of all the ongoing connections. If a router brakes down, other routers will detect the failure and, provided alternate paths exist, traffic will flow again as soon as the routing protocol has converged. If, instead, a NAT box brakes down, all the connections must be restarted. These shortcoming notwithstanding, NAT is a widely used solution, especially in home-networks. Typically a single box implements all the functions of an ADSL (or cable) modem, a NAT/router, a DHCP server and an Ethernet switch. 53

54 Bibliography [1] J. Abley and K. Lindqvist. Operation of Anycast Services. RFC 4786 (Best Current Practice). Dec [2] F. Baker. Requirements for IP Version 4 Routers. RFC 1812 (Proposed Standard). June [3] T. Berners-Lee, R. Fielding, and H. Frystyk. Hypertext Transfer Protocol HTTP/1.0. RFC 1945 (Informational). May [4] Olivier Bonaventure. Computer Networking: Principles, Protocols and Practice. Available at [5] R. Braden. Requirements for Internet Hosts - Communication Layers. RFC 1122 (Standard). Oct [6] S. Cheshire. IPv4 Address Conflict Detection. RFC 5227 (Proposed Standard). July [7] M. Cotton et al. Internet Assigned Numbers Authority (IANA) Procedures for the Management of the Service Name and Transport Protocol Port Number Registry. RFC 6335 (Best Current Practice). Aug [8] W.J. Croft and J. Gilmore. Bootstrap Protocol. RFC 951 (Draft Standard). Sept [9] R. Droms. Dynamic Host Configuration Protocol. RFC 1531 (Proposed Standard). Oct [10] R. Droms. Dynamic Host Configuration Protocol. RFC 2131 (Draft Standard). Mar [11] R. Fielding et al. Hypertext Transfer Protocol HTTP/1.1. RFC 2616 (Draft Standard). June [12] V. Fuller et al. Classless Inter-Domain Routing (CIDR): an Address Assignment and Aggregation Strategy. RFC 1519 (Proposed Standard). Sept [13] Wolfgang John, Maurizio Dusi, and K. C. Claffy. Estimating routing symmetry on single links by passive flow measurements. In: Proceedings of the 6th International Wireless Communications and Mobile Computing Conference. IWCMC 10. ACM, 2010, pp [14] Wolfgang John and Sven Tafvelin. Analysis of internet backbone traffic and header anomalies observed. In: Proceedings of the 7th ACM SIGCOMM conference on Internet measurement. IMC 07. ACM, 2007, pp

55 [15] J. Klensin. Internationalized Domain Names for Applications (IDNA): Definitions and Document Framework. RFC 5890 (Proposed Standard). Aug [16] James F. Kurose and Keith W. Ross. Computer Networking: A Top-Down Approach. 5th. Addison-Wesley Publishing Company, [17] K. McCloghrie, R. Fox, and E. Decker. IEEE Token Ring MIB. RFC 1231 (Proposed Standard). May [18] S. McCreary and k. claffy. Trends in wide area IP traffic patterns - A view from Ames Internet Exchange. In: ITC Specialist Seminar [19] P.V. Mockapetris. Domain names - concepts and facilities. RFC 1034 (Standard). Nov [20] P.V. Mockapetris. Domain names - implementation and specification. RFC 1035 (Standard). Nov [21] J.C. Mogul. Broadcasting Internet datagrams in the presence of subnets. RFC 922 (Standard). Oct [22] J.C. Mogul and J. Postel. Internet Standard Subnetting Procedure. RFC 950 (Standard). Aug [23] T. Narten et al. Neighbor Discovery for IP version 6 (IPv6). RFC 4861 (Draft Standard). Sept [24] K. Nichols et al. Definition of the Differentiated Services Field (DS Field) in the IPv4 and IPv6 Headers. RFC 2474 (Proposed Standard). Dec [25] D. Plummer. Ethernet Address Resolution Protocol: Or Converting Network Protocol Addresses to 48.bit Ethernet Address for Transmission on Ethernet Hardware. RFC 826 (Standard). Nov [26] J. Postel. Assigned numbers. RFC 790 (Historic). Sept [27] J. Postel. Internet Protocol. RFC 791 (Standard). Sept [28] J. Postel and J. Reynolds. File Transfer Protocol. RFC 959 (Standard). Oct [29] K. Ramakrishnan and S. Floyd. A Proposal to add Explicit Congestion Notification (ECN) to IP. RFC 2481 (Experimental). Jan [30] K. Ramakrishnan, S. Floyd, and D. Black. The Addition of Explicit Congestion Notification (ECN) to IP. RFC 3168 (Proposed Standard). Sept [31] Y. Rekhter and T. Li. An Architecture for IP Address Allocation with CIDR. RFC 1518 (Historic). Sept [32] Y. Rekhter et al. Address Allocation for Private Internets. RFC 1918 (Best Current Practice). Feb [33] A. Retana et al. Using 31-Bit Prefixes on IPv4 Point-to-Point Links. RFC 3021 (Proposed Standard). Dec [34] J. Reynolds. Assigned Numbers: RFC 1700 is Replaced by an On-line Database. RFC 3232 (Informational). Jan

56 [35] J.K. Reynolds and J. Postel. Assigned numbers. RFC 1060 (Historic). Mar [36] J. Rosenberg et al. SIP: Session Initiation Protocol. RFC 3261 (Proposed Standard). June [37] Abraham Silberschatz, Peter Baer Galvin, and Greg Gagne. Operating System Concepts. 8th. Wiley Publishing, [38] Andrew Tanenbaum. Computer Networks. 4th. Prentice Hall Professional Technical Reference, [39] Andrew S. Tanenbaum. Modern Operating Systems. 3rd. Prentice Hall Press, [40] Laurent Toutain. Réseaux locaux et Internet. 3ème édition. Hermès,

57 Introduction to IP Networks RES 302 RES 302 Alberto Blanc, Jean-Pierre Le Narzul, Nicolas Montavont RSM Department Fall 2012 Course Presentation Main goal: an introduction to IP networks top-down approach (starting from the Application layer) lectures will concentrate mainly on the Application, Transport and Network layers you will have to design and implement a simple session protocol New format this year: one fewer module (this year: 301, 302, 303; until last year: 301, 302, 303, 304) the new 302 vaguely corresponds to the old 302 and 303 a large part of the course dedicated to the project show the importance of standards (interoperability) changes motivated (among other factors) by the comments of the students 2 RES 302

58 Course Overview overview of the TCP/IP protocol suite socket programming (Python) the IP protocol addressing routing Network Address Translation Application Layer Protocols (SIP) 3 RES 302 The project (I) Context: Chat While You Watch application chat with other users watching the same video (movie) streaming video server users choose a video to watch one chatroom for each video (users join the chatroom automatically) client and server need to use a session protocol: basically a signaling protocol (join, leave, get movie list, join movie, etc.) you will have to specify and implement your own we have already developed the application you have to implement only the protocol-specific parts 4 RES 302

59 The project (II) Goal: write the protocol specification and implement it hands-on approach similar to what is done at the Internet Engineering Task Force groups of 4 people up to the students to form the groups (do it as soon as possible) If you have not found a group by Sep 10 write to [email protected] so that we can put in touch with each other those without a group groups must be formed by Sep 12 (send an to [email protected] with the names) 5 RES 302 Project Goals and Motivations Goals: design and implement a simple communication protocol understand the role (and importance) of standards introduction to network programming Motivations: networks exist to connect end-nodes (which are more numerous than intermediate systems) end-nodes are extremely important more and more communicating devices (objects) understanding their interactions with the network is useful layered model same principles apply to lower level protocols (each protocol uses the services offered by the lower layers) network programming is far from simple it can be a useful and profitable skill 6 RES 302

60 Course Calendar Sep 7 morning C-1 C-2 Introduction & Python programming Sep 7 afternoon TP-1 Python socket programming Sep 10 morning C-3 C-4 IP protocol Sep 10 afternoon TP-2 Twisted framework programming Sep 12 afternoon PC-1 PC-2 IP addressing PC and protocol specification Sep 17 afternoon PC-3 PC-4 IP addressing PC and protocol specification September 21 afternoon PC-5 PC-6 specification presentation, discussion and selection Oct 1 afternoon TP-3 implementation (1/3) Oct 8 afternoon TP-4 implementation (2/3) Oct 15 morning C-5 C-6 TCP, Routing Oct 15 afternoon TP-5 implementation (3/3) Oct 19 morning C-7 C-8 application-layer protocols Oct 19 afternoon PC-7 PC-8 application-layer protocols Oct 26 afternoon PC-9 PC-10 packet analysis Nov 5 afternoon TP-6 project evaluation 7 RES 302 CWW 8 RES 302

61 Students Contributions (I): Protocol Specification (I) Each group has to write a different protocol specification do not even glance at the specifications of another group any form of copying (even minor) will be severely sanctioned the functional requirement document is available on Moodle: your protocol must satisfy all the requirements: what messages are exchanged (e.g., login request, user list request, etc.) the format of each message each group has to upload the specs on Moodle by September 20 (9 am, CET) each group will present their proposal (September 21) you are encouraged to use xml2rfc (see sample file on Moodle) IETF tool do compile very often (detect errors as soon as possible) 9 RES 302 Students Contributions (I): Protocol Specification (II) Why do we need standards (i.e., protocol specification)? different and independent groups can write inter-operable implementations in real life this means different vendors can offer compatible products (e.g., routers, web-browser/servers, clients/servers etc.) goal of your protocol specification document: two other groups should be able, by only reading your document, to implement a client and a server which can work correctly together Do Not Panic! You are not expected to write a perfect protocol specification (you will be graded mainly on your effort). Just make sure your specification contains all the messages needed and their format. 10 RES 302

62 Students Contributions (II) during the presentations (September 21) we will: either select one proposed protocol specification or (most likely) merge/modify some of the proposed ones to have the final protocol specification (for the implementation) all the groups will implement the same specification (makes interoperability tests possible) implementing a protocol specification is essential to prove the specification is complete and reduces the risk of major errors (the IETF requires at least two independent implementation of a proposed protocol before this can become a standard). your implementation can cover only a subset of the requirements you will be responsible only for the protocol implementation (re-use existing code) 11 RES 302 Deadlines and Grading date time September 20 9 am, CET upload protocol specification on Moodle September 21 afternoon specification presentation and selection Nov 5 afternoon final project evaluation and interoperability tests TP: 3, 4, 5, 6 submit code for testing Grading: final grade for the UV: G = 0.5 C s (0.7 C c C c C c303 ) where: Cs = contrôle semestriel Cci = contrôle continu for module i 12 RES 302

63 Code Testing why? to make sure each group is working throughout the project to force you to test your code often (there is no point in writing 100 (or more) lines of code without trying to execute them!) what? tests covering specific goals (e.g., sending a well-formed packet) when? each group MUST commit their changes to svn by 23:59 of the day before each lab programming session ( TP ) pass/no-pass grades used mainly for progress/effort monitoring SVN You are expected to read a tutorial about SVN. For example the one at you should read the following three sections: Creating a Working Copy, Basic Work Cycle, Examining History (just click on the Next link on the top right hand corner) and, optionally, the Sometimes You Just Need to Clean Up section. 13 RES 302 PCs/TP groups for one of the groups the teacher (Alberto Blanc) will speak in English (other group(s) in French) students can ask questions either in English or French in all groups it is up to each student to choose their group: please do this during the break hopefully the split will be somewhat even you can submit protocol specifications either in English or in French 14 RES 302

64 Introduction to TCP offers a byte stream service Bytes will be delivered in order and reliably no guarantees about message boundaries reliable service The rest will be on the board 15 RES 302 Introduction to Python interpreted language (i.e., no compilation/linking needed) dynamic typing (no variable declaration) type checking done only at run-time often errors not detected until run-time high level language many existing libraries (the standard one is fairly large) object-oriented (optionally) useful for scripting (much easier than shell or Perl!) good tutorial at Homework You are expected to read the following chapters of the tutorial by the end of next week: 1, 2, 3, 4, 5 and 9. You are strongly encouraged to read the other chapters as well. 16 RES 302

65 Built-in Types and Data Structures Python has the usual built-in types: integer, real, (complex), string, etc. Several built-in data structures as well: lists (ordered mutable collections, l = [1,2,3]) tuples (ordered immutable collections, t = (1,2,3)) and a few more 17 RES 302 Caveats indentation is important: blocks of code (e.g., a loop) identified by the indentation use 4 spaces to indent code do not use tabs! (make sure that your editor uses spaces when you press the tab key) mixing tabs and spaces is a recipe for disaster (cryptic error messages) parameter passing: by value for immutable objects (i.e., numbers, strings, tuples, etc.) by reference for mutable objects (i.e., lists, dictionaries, etc.) different versions of Python: we will be using 2.7 version 3.2 already available but not yet universally adopted (somewhat) significant differences between 2.x and 3.x 18 RES 302

66 A First Python Script #!/usr/bin/env python2.7 import sys print "This is a simple program which adds two numbers " message = u Type two numbers (put a space between them and press enter) \n while True: inputstring = raw_input(message) inputstring = inputstring.decode(sys.stdin.encoding) fields = inputstring.split() a = int(fields[0]) b = int(fields[1]) print "the answer is: ", a + b 19 RES 302 Comments the first line ( #!/usr/bin/env python2.7 ) tells the shell to use the python2.7 command to execute the script (the same as typing python2.7 add.py ) the import statement loads the module sys (like #include in C) can use simple or double " quotation marks for strings (the u text syntax defines unicode strings) int(string) converts a string to an integer (can throw and exception) 20 RES 302

67 Operating System Processes from Wikipedia: a process is an instance of a computer program that is being executed each time you execute a Python script the OS creates a new process each process has its own context, including variables multiple processes can be instances of the same program (e.g., add.py) new processes created with the fork system call: it clones a process (the two processes are exactly the same, except for the return value of the fork system call) it is up to the programmer to handle the fork correctly after the fork the two processes do not share memory (i.e., variables) after the fork the two processes do share file descriptors (i.e., files and sockets) processes can talk to each other (this is beyond the scope of RES 302) process thread (this is beyond the scope of RES 302) 21 RES 302 A Python Fork Example #!/usr/bin/env python2.7 import sys, os l = [1,2,3] f = open( out.txt, w,0) childpid = os.fork() if childpid == 0: print "This is the child" f.write( this is the child writing\n ) l.append(4) print "l in the child: ", l else: print "This is the parent, child PID: ", childpid f.write( this is the parent writing\n ) l.append(25) print "l in the parent: ", l 22 RES 302

68 Output This is the shell output: This is the parent, child PID: l in the parent: [1, 2, 3, 25] This is the child l in the child: [1, 2, 3, 4] This is the output file: this is the parent writing this is the child writing 23 RES 302 Socket Programming (I) a TCP connection a socket (type SOCK STREAM) a UDP connection a socket (type SOCK DGRAM) socket (descriptor) file (descriptor) can use normal read and write functions special functions to create sockets (UDP and TCP) special functions to handle connection establishment (needed for TCP sockets) 24 RES 302

69 The socket Module socket class (same name as the module) several functions and constants as well example: socket.error socket.sock_stream socket.sock_dgram socket.gethostbyname (hostname) class socket documentation available at 25 RES 302 The socket constructor socket. socket ([ family [, type [, proto ]]]) builds an object of type socket all the other functions (see below) need a socket object parameters (all of them optional): family: possible values defined as constants in the socket module: AF INET (by default), AF INET6, AF UNIX type: ossible values defined as constants in the socket module: SOCK STREAM (by default), SOCK DGRAM proto: generally not used, the type is implied by the two previous parameters example: s = socket.socket( socket.af_inet, socket.sock_stream ) 26 RES 302

70 The bind function socket.bind (address) allows a server process to specify the port number it wants to listen on it is also possible to specify an IP address (but one has to be sure that the specified IP address exists on the machine running the program) usually the IP address is not specified the address parameter is a tuple with 2 elements typically used only on the server side example: (address, port) HOST = # allows binding with any of the IP addresses of the machine PORT = s.bind((host, PORT)) 27 RES 302 The listen function socket.listen(backlog) used to make the socket a server socket (TCP goes into the listen state) after this call the socket can only be used to handle incoming connections (not to actually exchange data) non-blocking call the backlog parameter specifies the number of pending requests which can be stored (just use 10) it is not the maximum number of concurrent connections handled by the server example: s.listen(10) 28 RES 302

71 The accept function socket.accept() accepts incoming connections (used on the server) blocking call returns a tuple with 2 elements: a new socket object and the address of the client (the other end-point) example: socket_client, addr_client = s.accept() 29 RES 302 The connect function socket.connect(address) establish a connection with the server specified in the address parameter the address parameter is a tuple with 2 elements: the IP address and the port number of the remote node example: HOST = pc-df-302 PORT = s.connect((host, PORT)) 30 RES 302

72 Socket Programming (II): pseudo-code s = socket.socket(...) s = socket.socket(...) s.bind(...) s.listen(...) s.connect(...) ns = s.accept(...) s.send(...) ns.receive(...) s.receive(...) ns.send(...) Important This slide and the following only show pseudo-code, not actual Python code. It is easy to write the correct instructions by reading the Python socket module documentation ( 31 RES 302 Socket Programming (III) Parent Process: s = socket.socket(...) s.bind(...) s.listen(...) Child Process: s.close(...) ns = s.accept(...) os.fork(...) ns.receive(...) ns.close(...) ns.send(...) 32 RES 302

73 Majeure RES 302 : Réseaux IP Majeure RES - UV 1 Notes de cours - Année Jean-Marie BONNIN Responsable du département RSM [email protected] Tél : David ROS Département RSM [email protected] Tél : Plan du cours C1 - C8 Modèle Internet vs. OSI, standardisation IP et protocoles associés Les protocoles TCP, UDP et SCTP Adressage et traduction d adresses Principes des protocoles de routage, le protocole RIP Introduction aux sockets et DNS PC1-2 PC3-4 TP1 Désassemblage de trames TCP/IP Plan d adressage IP Programmation sockets 2

74 Plan du cours C1 - C8 Modèle Internet vs. OSI, standardisation Les protocoles IP, ARP et ICMP Les protocoles TCP, UDP et SCTP Adressage et traduction d adresses Principes des protocoles de routage, le protocole RIP Introduction aux sockets et DNS PC1-2 PC3-4 TP1 Désassemblage de trames TCP/IP Plan d adressage IP Programmation sockets 5

75 Modèles ISO et Internet! ISO (OSI) Modèle de référence : très populaire - Intérêt : outil conceptuel (et pédagogique) Primitives de service, SAP... Protocoles : dans la pratique =??! Internet (TCP/IP) Modèle de référence : léger - Approche plus pragmatique et informelle Protocoles : extrêmement populaires 6 Modèles ISO et Internet : quelques différences! Architecture protocolaire en couches OSI : principe sacro-saint? Internet : principe à respecter, mais... - «Perméabilité» entre couches Ex. : IP sur Ethernet (passage à IP des octets de bourrage Ethernet) - Où classer certains protocoles? - «Is Layering Harmful?» [Crowcroft et al., 1992] - Optimisation cross-layer 7

76 Modèles ISO et Internet : quelques différences (suite)! Conception des protocoles OSI : d abord le modèle, ensuite les protocoles - Exemple de problème : couche 2 conçue pour des liaisons point-à-point Internet : d abord les protocoles, ensuite un modèle pour décrire ce qui existait déjà! Couche réseau (3) OSI : orientée connexion ou non orientée connexion (datagrammes) Internet : datagrammes seulement - Philosophie : offrir un service connecté au niveau 4 (TCP) 8 Modèles ISO et Internet Architectures de couches protocolaires : (possible) correspondance approximative ISO (OSI) Internet (TCP/IP) 7 Application 6 Présentation Application 5 Session 4 Transport Transport communication de bout en bout 3 Réseau Réseau acheminement (routage) entre nœuds du réseau 2 Liaison Liaison accès au réseau physique 1 Physique Physique 9

77 Le modèle Internet : architecture protocolaire Mise en œuvre typique des couches protocolaires dans un système Unix processus niveau utilisateur Application gère les détails applicatifs processus niveau kernel (noyau du système d exploitation) Transport Réseau gère les détails de la communication Liaison 10 Pile de protocoles Internet (un petit sousensemble) Application application application application application Transport UDP TCP Réseau ARP IP ICMP Liaison Physique Ethernet IEEE (+ LLC + SNAP) PPP! Il n est pas évident de tracer une frontière nette entre couches! Il n est pas évident de placer certains protocoles (ex. ICMP, ARP) 11

78 Exemples de protocoles au niveau applicatif protocoles de routage Application OSPF RIP http ping Transport UDP TCP Réseau ARP IP ICMP Liaison Physique Ethernet IEEE (+ LLC + SNAP) PPP 12 Réseaux IP : encapsulation des données Données application bloc de données Application en-tête TCP bloc de données TCP segment TCP en-tête IP en-tête TCP bloc de données IP paquet IP en-tête Ethernet en-tête IP en-tête TCP bloc de données CRC Ethernet Pilote Ethernet trame Ethernet Exemple : TCP/IP dans un réseau local Ethernet Physique 13

79 Réseaux IP : démultiplexage des données ICMP 1 En-tête Ethernet : champ Type RIP UDP IP 17 0x Ethernet 6 http TCP ARP 0x806 Exemple : TCP/ IP dans un réseau local Ethernet 80 En-tête TCP ou UDP : champs N de port source et destination (+ adresse IP source) En-tête IP : champ Protocole! Problème : comment faire remonter les données au long de la pile de protocoles?! Solution : les en-têtes contiennent information permettant «l aiguillage» 14 Réseaux IP : interconnexion de réseaux hétérogènes équipement terminal serveur web protocole http (niveau applicatif) équipement terminal client web TCP protocole TCP (de bout en bout) TCP IP protocole IP routeur (nœud) IP protocole IP IP protocole Ethernet Ethernet Ether. PPP protocole PPP PPP LAN Ethernet liaison série 15

80 Internet et IP! Internet = ensemble de réseaux (Autonomous Systems ou AS) connectés entre eux Internetworking = interconnexion de réseaux AS = domaine administrative! IP = Internet protocol Deux versions incompatibles entre elles - IPv4 : version la plus courante aujourd hui - IPv6 : IP «nouvelle génération» 16 Qu offre IP aux couches supérieures?! Un service à datagrammes Acheminement des paquets de la source à (aux) destination(s) Chaque paquet contient l adresse complète du destinataire Service non orienté connexion - Les paquets sont acheminés (= routés) indépendamment les uns des autres deux paquets consécutifs peuvent suivre des routes différentes en pratique c est assez rare et on cherche à l éviter 17 [Kurose & Ross 2003]

81 Qu offre IP aux couches supérieures?! Un service à datagrammes (suite) Service non fiable - Perte de paquets : possible - Duplication de paquets : possible - Arrivée des paquets en séquence : non garantie On parle également de service best-effort 18 Interconnexion de réseaux IP AS 1 AS 2 B A! Adresses avec signification globale (En principe) Il ne peut pas y avoir d adresse dupliquée : sinon, ambiguïté dans l identification du destinataire! Relayage des paquets IP Les routeurs analysent les adresses contenues dans les paquets Table de routage : quel est le prochain routeur au long du chemin pour atteindre B? 19

82 Internet : quelques principes architecturaux! Extrait du RFC 1958 par [Tanenbaum 2002] : Assurez-vous que ça marche - Testez avant de finaliser la norme Préférez toujours la solution la plus simple - Appliquez le Rasoir d Occam S il y a plus d une manière de faire une chose, en choisissez une - Évitez une multiplicité d options et paramètres Exploitez la modularité - Pile protocolaire avec indépendance entre couches Préparez-vous à la diversité - Flexibilité pour faire face à des équipements très hétérogènes 20 Internet : quelques principes architecturaux! Extrait du RFC 1958 par [Tanenbaum 2002] ; suite : Évitez les options et les paramètres «codés en dur» - Préférez la négociation de ces valeurs / options Cherchez un bon design, pas un design parfait Soyez stricte comme émetteur, mais tolérant comme récepteur - Générez des paquets conformes aux normes, mais attendez-vous à en recevoir qui ne soient pas conformes Pensez au passage à l échelle (scalability) Tenez compte de la performance et du coût 21

83 Réseaux IP et l Internet : normalisation ISOC (Internet Society) IAB (Internet Architecture Board) IANA (Internet Assigned Numbers Authority) IESG (Internet Engineering Steering Group) IRSG (Internet Research Steering Group) IETF (Internet Engineering Task Force) IRTF (Internet Research Task Force) Area WG WG... Area WG WG Research group RG WG WG RG etc Normalisation à l IETF! Domaines (Areas) Ex. : Transport! Groupes de travail (Working Groups) Ex. : tsvwg, sigtrans! Principe de base : «rough consensus and running code»! Participation aux décisions de normalisation : ouverte à tout le monde 3 réunions par an Listes de diffusion Création de groupes de travail 23

84 Normalisation à l IETF! Accès libre et gratuit à tous les documents : Documents Normes : Request For Comments - Ex. : RFC 793 (TCP) - Caractère permanent Documents de travail : Internet Draft - Ex. : draft-ietf-tsvwg-tcp-eifel-alg-07.txt (algorithme Eifel pour TCP) - Caractère éphémère (6 mois de validité) 24 Normalisation à l IETF! Processus de normalisation Standards track : - Soumission d un draft personnel : draft-untel-mon-sujet-favori-00.txt à : [email protected] - Discussion dans les listes de diffusion et les réunions IETF Faire adopter le draft comme working group item : draft-ietf-xxxwg-mon-sujet-favori-00.txt Atteindre un consensus sur la liste de diffusion (last call) Donner le document à un Area Director Last call dans tous les groupes Si acceptation : envoi au RFC Editor (et à l IANA, si besoin d allouer des valeurs protocolaires) - RFC : proposed standard, puis draft standard et enfin standard 25

85 Les RFC! Classes de RFC : Documents issus du processus de normalisation (proposed standard, draft standard, standard) Documents non issus du processus de normalisation - Experimental - BCP (Best Current Practice) - FYI (For Your Information) -... Attention à la date! - RFC 1149 (1er avril 1990) : A Standard for the Transmission of IP Datagrams on Avian Carriers - RFC 2549 (1er avril 1999) : IP over Avian Carriers with Quality of Service - RFC 3514 (1er avril 2003) : bit «paquet méchant» dans l en-tête IP 26 Évolution des normes! En général, un protocole! un RFC Exemple : TCP - RFC 793 (spécification d origine) - RFC 1122 (Requirements for Internet Hosts) - RFC 1323 (Extensions for High Performance) - RFC 2018, 2883 (Selective Acknowledgment) - RFC 2581 (Congestion Control) - RFC 2988 (Retransmission Timer) - etc. etc

86 Activités de recherche : l IRTF! Vision à plus long terme! Des groupes parfois fermés Appartenance laissée au choix du chair Listes de diffusion publiques! Quelques exemples End-to-end (fermé aujourd hui) Anti-spam Internet Measurement DTN (Delay/Disruption Tolerant Network) 28 Considérations à propos du modèle en couche! Traditional layering my a**. When the present Internet architecture was constructed in the 1970's there was no OSI model at all. It would be revisionist (alas too common) to imagine that the *goal* of communications architecture is to fit into a frame (OSI) that was conceived ex post as merely an explanatory tool for the decisions about modularity that were made on far more serious grounds than a mere picture of a stack. (end2end, 11/05/2008)! If there is any revisionist history, Dave it is yours. I don't remember who said it, but when I refer traditional layering I am referring to the ideas we had of it prior to OSI. Traditional layering did predate the OSI model by probably 5 years or more. Bachman's 7 layers (OSI) were proposed in 78. I remember that by at least 72-73, it was common to talk about Physical, Data Link, Network and Transport Layers. In the ARPANet, we weren't sure what was above that. (end2end, 12/05/2008) 29

87

88 Plan du cours C1 - C8 Modèle Internet vs. OSI, standardisation Les protocoles IP, ARP et ICMP Les protocoles TCP, UDP et SCTP Adressage et traduction d adresses Principes des protocoles de routage, le protocole RIP Introduction aux sockets et DNS PC1-2 PC3-4 TP1 Désassemblage de trames TCP/IP Plan d adressage IP Programmation sockets 30 Couche 3 : architecture (simplifiée) Application ping traceroute Transport TCP UDP Réseau ARP IP ICMP Liaison Ethernet / IEEE

89 Internet : Interconnexion de réseaux 32 Interconnexion de réseaux IP équipement terminal application équipement terminal application transport transport routeur routeur IP IP IP IP Ethernet Ether. PPP PPP Ether. Ethernet réseau Ethernet liaison série réseau Ethernet 33

90 Internet Protocol (IP) Application DHCP ping traceroute Transport TCP UDP Réseau ARP IP ICMP Liaison Ethernet / IEEE Format d un paquet IPv4 32 bits Version = 4 Longueur en-tête Type de service Longueur totale (octets) 20 octets Durée de vie (TTL) Identification Protocole Drapeaux Place du fragment Checksum de l en-tête Adresse IP source Adresse IP destination Options (s il y en a) bourrage Données (de la couche / du protocole du niveau supérieur) 35

91 Champs Longueur! Longueur en-tête (4 bits) : en mots de 32 bits Taille maximale (avec options) = 60 octets! max. 40 octets d options! Longueur totale (16 bits) Max. théorique = IP sur Ethernet : nécessaire pour distinguer l information utile du bourrage trame Ethernet " 0 octets " 0 octets adresse destination adresse source protocole = IP (0x800) paquet IP bourrage CRC remonté à la couche IP 36 Champ Type de service (TOS)! Initialement (RFC 791, 795, 1349) : priorité des paquets et routage avec contraintes Priorité = traitement différent dans les routeurs - 0 = plus basse, 7 = plus haute priorité Trafic best-effort : TOS = Priorité délai + débit + fiabilité coût 0 Service 37

92 Champ Type de service (TOS)! Définition courante (RFC 2474, 3168) : différenciation de services (DiffServ) + notification de congestion (ECN) Champ DSCP : - 6 valeurs Class Selector (compatibilité avec l ancien champ Priorité), dont le traitement best-effort classique - 12 valeurs pour l Assured Forwarding - 1 valeur pour l Expedited Forwarding DiffServ Code Point (DSCP) ECN 38 Fragmentation! Si la MTU de la liaison ne permet pas de transporter le paquet entier : envoi du paquet en fragments! Le réassemblage est fait uniquement par le destinataire final! Mécanisme coûteux pour les routeurs 39 Dessin : [Tanenbaum, 2002]

93 Fragmentation : champs de l en-tête! Identification : numéro unique (pour l émetteur) Si le paquet est fragmenté après, tous les fragments le portent! Place du fragment : position du 1e octet du fragment dans le datagramme original (non fragmenté) Découpe des fragments en multiples de 8 octets! DF (don t fragment) = 1 : le paquet ne doit pas être fragmenté Si fragmentation nécessaire : destruction du paquet + génération d un message ICMP vers la source! MF (more fragments) MF = 0 dernier fragment! Drapeaux par défaut (paquet non fragmenté) : DF = MF = 0 16 bits 1 bit 1 bit 1 bit 13 bits Identification 0 DF MF Place du fragment 40 Fragmentation : exemple max. données = 4096 max. données = 1024 max. données = 512 E R1 R2 D ID D F M F place (2021 octets de données) (1024 octets) (512 octets) (512 octets) (997 octets) (512 octets) (485 octets) 41

94 Champ Durée de vie (TTL)! Initialisé à une valeur > 0 Valeur typique = 64! Décrémenté d une unité : À chaque fois que le paquet traverse un routeur Une fois/sec, si le paquet est en attente de réassemblage dans la station destinataire A B Boucle de routage (tables de routage erronées) R1 R2 R3 42 Champ Protocole Application DHCP ping traceroute Transport TCP UDP protocole = 6 protocole = 17 Réseau ARP IP protocole = 1 ICMP type = 0x806 type = 0x800 Liaison Ethernet / SNAP 43

95 Options IP! Pour des fonctionnalités peu utilisées ou non prévues dans la norme initiale! En-tête de taille variable : problèmes de performance Mise en œuvre typique dans un routeur : - Une file d attente «rapide» (= optimisée) pour les paquets sans options - Une file d attente «lente» (= traitement plus coûteux) pour les paquets avec options! Taille max. des options : 40 octets Peut s avérer limitatif pour certaines options 44 Options IP! Avec ou sans paramètres! Format d une option : Copié = doit être présente dans tous les fragments d un paquet Classe = 00 (contrôle) ou 10 (mesures) Numéro = ID de l option Si paramètres! champ Longueur présent 1 bit 2 bits 5 bits 8 bits (variable) copié classe numéro longueur totale paramètres 45

96 Quelques options! EOOL (End Of Option List) : fin de la liste d options! NOP (No OPeration) : option nulle pour aligner les options sur des mots de 32 bits! LSR (Loose Source Route) : suggère la route à suivre Paramètres : liste d adresses des routeurs! SSR (Strict Source Route)! RR (Record Route) : les routeurs ajoutent leur adresse au champ Paramètres Peut être utilisée par la commande ping Nombre max de routeurs enregistrés = 9! RTRALT (Router Alert) : utilisée par p.ex. RSVP (protocole de signalisation) pour demander un traitement au routeur 46 Adresses IP! Adresse IP : n d identification unique de portée global Adressage : doit être résistant au facteur d échelle! Sur 32 bits : (en théorie) 2 32 " 4 # 10 9 adresses différentes Taille fixe pour simplifier les décisions de routage - Traitement de l adresse paquet par paquet (= mode datagramme) - Si la taille des adresses est trop petite : problèmes 47

97 Adresses IP (suite)! Affectation des adresses en suivant des règles Si choix aléatoire : - Doublons? (unicité de l adresse) - Routage? Ces règles doivent être efficaces - Espace d adressage fini! pas trop gaspiller 48 Adresses IP (suite)! Unicité de l adresse vis-à-vis des interfaces réseau, pas des machines Routeur : multiples interfaces (une par réseau) Station : (typiquement) une interface (active) PC1 PC2 PC3 P1 + PC / / /24 Res1 Res / / /24 PC4 PC5 49

98 Adresses IP particulières! Loopback : pas de paquet généré sur le réseau! Diffusion (broadcast) : atteindre toutes les machines dans un réseau local Correspondance diffusion IP $ diffusion MAC Également : diffusion (Unix BSD) 50 Dessin : [Tanenbaum, 2002] Adresses IP et Netmask! Adresse IP = préfixe + identifiant Préfixe : - Désigne le réseau IP auquel l interface est attachée Un réseau IP = un réseau niveau 2 (ex. Ethernet) Entre deux réseaux il doit y avoir un routeur IP - Commun à toutes les interfaces sur ce réseau Identifiant : désigne l interface sur ce réseau! Longueur de préfixe : a.b.c.d/lp lp donne le nombre de bits du préfixe! Autre représentation du préfixe : adresse & netmask le nombre de bits à 1 donne la longueur du préfixe netmask Préfixe 32 bits Identifiant variable

99 Adresses IP : notation! 4 blocs de 1 octet, séparés par un point! Représentation décimale Conversion entre adresses IP et noms : DNS (Domain Name Service) $ morrocoy.irisa.fr Notation abrégée pour le réseau : préfixe / nombre de bits à 1 du netmask / 26 (netmask = ) 52 Adresses IP et Netmask : exemple Quelle partie correspond au réseau et quelle partie à la station? 53 Dessin : [Toutain, 2003]

100 Classes d adresses IP (avant 1994)! 126 réseaux classe A de " 16 # 10 6 machines chacun! Environ 16 # 10 3 réseaux classe B de " 65 # 10 3 machines chacun! Environ 2 # 10 6 réseaux classe C de " 250 machines chacun 54 Dessin : [Tanenbaum, 2002] Exemple! Le préfixe « » est attribué à votre société Quelle serait la classe par défaut ce réseau? Votre administrateur système utilise dans son réseau des préfixes de longueur 26, combien de sous-réseaux pourrait-on définir? Quelle serait l adresse de broadcast sur le sous-réseau 3? 55

101 Adressage IP (avant 1994)! Espace d adressage plat Pas de numérotation hiérarchique Pas de rapport entre adresse et localisation géographique : privilégier la simplicité d administration / 16 = IntelliCorp (Etats-Unis) / 16 = INRIA (France) / 16 = Agere Systems (Etats-Unis) Administration centralisée : NIC (Network Information Center)! Classes A, B, C : utilisation inefficace et peu flexible des adresses! Évolution : CIDR Adresses privées + NAT IPv6 (adresses sur 128 bits) 56 IPv6 Un en-tête simplifié 32 bits Ver. Traffic Class Flow label Payload length Next Header Hop Limit 5 words Source Address 40 Bytes Destination Address 57

102 Vers le niveau 3 Application DHCP ping traceroute Transport TCP UDP Réseau ARP IP ICMP Liaison Ethernet / IEEE Protocole ARP! Address Resolution Protocol (RFC 826)! Correspondance adresse réseau (IP) % adresse MAC Les applications ne manipulent que des adresses IP - Dans un sous-réseau IP : adresses affectées en suivant certaines règles Les trames sont échangées en utilisant les adresses MAC - Dans un sous-réseau IP : numérotation aléatoire! Il faut connaître l adresse MAC de la destination pour lui envoyer une trame! Vérification de l unicité d une adresse L adresse IP doit être unique - La partie préfixe assure qu elle n est pas utilisée dans un autre réseau - Il faut s assurer qu elle est unique dans le réseau local 59

103 Protocole ARP (suite)! Table de correspondance (cache) dynamique Construite et mise à jour par le système Chaque ligne a une durée de vie finie morrocoy[15:24]% arp -a default-gw.irisa.fr ( ) at 0:4:80:13:69:0 air.irisa.fr ( ) at 8:0:20:89:58:95 sky.irisa.fr ( ) at 8:0:20:ac:44:3 cuvert1.irisa.fr ( ) at 8:0:11:13:99:e5 60 Protocole ARP (suite)! Si l adresse du destinataire n est pas dans la table! requête ARP : trame Ethernet en mode diffusion émetteur destinataire (IP = &, MAC = ') requête ARP (broadcast) qui a l adresse IP = &? réponse ARP (point à point) moi (adresse MAC = ') 61

104 Protocole ARP (suite)! Lorsque je configure l adresse IP! requête ARP : trame Ethernet en mode diffusion Configuration de l adresse IP = & requête ARP (broadcast) source qui a l adresse IP = &? Machine avec un conflit (IP = &, MAC = ') réponse ARP (point à point)! moi (adresse MAC = ') Adresse déjà utilisée 62 Protocole ARP (suite)! Lorsque je configure l adresse IP! requête ARP : trame Ethernet en mode diffusion Configuration de l adresse IP = & source requête ARP (broadcast) qui a l adresse IP = &? 1 s OK Je peux utiliser l adresse 63

105 Protocole ARP (suite)! Mise à jour de la table morrocoy[15:44]% ping PING ( ): 56 data bytes... morrocoy[15:45]% arp -a default-gw.irisa.fr ( ) at 0:4:80:13:69:0 air.irisa.fr ( ) at 8:0:20:89:58:95 sky.irisa.fr ( ) at 8:0:20:ac:44:3 cuvert1.irisa.fr ( ) at 8:0:11:13:99:e5 juventus.irisa.fr ( ) at 0:3:ba:d:b6:ce 64 Paquet ARP (pour Ethernet et IP) en-tête Ethernet paquet ARP (28 octets) adresse MAC destinataire (FF-FF-FF-FF-FF-FF si requête ARP) adresse MAC source protocole (= 0x806) type d adresse MAC (Ethernet = 1) type d adresse réseau (IP = 0x800) MAC (= 6) IP (= 4) code (1 = requête ARP, 2 = réponse ) adresse MAC destinataire ( si requête ARP) destinataire adresse IP source adresse MAC source adresse IP (+ bourrage + CRC) 65

106 Protocole ICMP Application DHCP ping traceroute Transport TCP UDP Réseau ARP IP ICMP Liaison Ethernet / IEEE Protocole ICMP! Internet Control Message Protocol (RFC 792) Protocole de contrôle permettant d échanger de l information de contrôle liée au fonctionnement IP. Fonctionne aussi en IPv6! But : échange de messages d erreur et de demande d information Traités soit par IP, soit par une couche supérieure! Niveau 3, mais encapsulé dans des paquets IP Champ Protocole = 1 Paquet IP En-tête IP Message ICMP 20 octets longueur variable 67

107 Format des messages ICMP Type Code Checksum contenu variable, dépendant du type de message et du code (+ en-tête IP + premiers 8 octets de données du paquet IP ayant déclenché le message, si message d erreur)! Le couple (Type, Code) permet d identifier de quelle classe de message s agit-il! Checksum : calculé comme pour l en-tête IP 68 Quelques types de messages ICMP Type Code Description Demande Erreur 0 0 Réponse à une demande d écho! 3 Destination non accessible : 0 Réseau inaccessible! 1 Station inaccessible! 2 Protocole inaccessible! 3 Fragmentation nécessaire mais bit DF = 1! 4 Port inaccessible! etc. 8 0 Demande d écho! 11 La durée de vie (TTL) a atteint 0 : 0 Durant le transit! 1 Durant le réassemblage! 69

108 Commande ping! Basée sur les messages ICMP de type 8 (echo request) et 0 (echo reply) Réception d un message type 8! émission d un message type 0! Format des messages Type (0 ou 8) Identificateur 0 Checksum Numéro de séquence (données optionnelles) La réponse contient une copie des champs Identificateur, N de séquence et les données optionnelles 70 ping : exemple morrocoy[19:07]% ping PING violin.sonycsl.co.jp ( ): 56 data bytes 64 bytes from : icmp_seq=0 ttl=39 time= ms 64 bytes from : icmp_seq=1 ttl=39 time= ms 64 bytes from : icmp_seq=2 ttl=39 time= ms 64 bytes from : icmp_seq=3 ttl=39 time= ms 64 bytes from : icmp_seq=4 ttl=39 time= ms 64 bytes from : icmp_seq=5 ttl=39 time= ms 64 bytes from : icmp_seq=6 ttl=39 time= ms 64 bytes from : icmp_seq=7 ttl=39 time= ms ^C --- violin.sonycsl.co.jp ping statistics packets transmitted, 8 packets received, 0% packet loss round-trip min/avg/max = / / ms N de séquence Temps d aller-retour 71

109 Programme traceroute! Basé sur les messages ICMP de type 11 / code 0 (time exceeded) et type 3 / code 4 (port unreachable) Envoi de datagrammes UDP qui déclenchent ces messages d erreur ICMP! Format de ces messages Type Code Checksum 0 En-tête IP (y compris les options) + premiers 8 octets du paquet IP original en-tête UDP 72 traceroute : fonctionnement Sens A % B : datagrammes UDP (avec N port destination # 33434) A TTL = 1 R1 R2 R3 ICMP time expired B TTL = 2 TTL = 1 ICMP time expired TTL = 3 TTL = 2 TTL = 1 ICMP time expired TTL = 4 TTL = 3 TTL = 2 TTL = 1 ICMP port unreachable 73

110 traceroute : exemple Pas de réponse (= message ICMP) au bout de 5 s morrocoy[22:45]% traceroute -n violin.sonycsl.co.jp traceroute to violin.sonycsl.co.jp ( ), 30 hops max, 40 byte packets ms 8.71 ms ms ms ms ms ms * ms ms ms ms 15 * ms ms ms ms ms ms ms ms TTL Route suivie par les paquets Temps d aller-retour 74 Path MTU! Découverte de la taille maximale des paquets au long de la route A " B pour éviter la fragmentation Émission avec le bit DF = 1 (en-tête IP) Si un routeur doit fragmenter, il retourne à la source un message d erreur ICMP : Type = 3 Code = 4 Checksum 0 MTU requis En-tête IP (y compris les options) + premiers 8 octets du paquet IP original 75

111

112 Plan du cours C1 - C8 Modèle Internet vs. OSI, standardisation Les protocoles IP, ARP et ICMP Les protocoles TCP, UDP et SCTP Adressage, auto-configuration et traduction d adresses Principes des protocoles de routage, le protocole RIP Introduction aux sockets et DNS PC1-2 PC3-4 TP1 Désassemblage de trames TCP/IP Plan d adressage IP Programmation sockets 76 Objectifs de cette partie! Comprendre la problématique des protocoles de transport Contrôle de flux Gestion de bout en bout de : - La fiabilité - La congestion 77

113 Plan : User Datagram Protocol Application DHCP ping traceroute Transport TCP UDP Réseau ARP IP ICMP Liaison Ethernet / IEEE Protocole UDP! User Datagram Protocol (RFC 768)! Applications/protocoles qui utilisent UDP NFS (Network File System) DNS (Domain Name System) DHCP (Dynamic Host Configuration Protocol) Applications multimédias (voix et vidéo interactives, streaming) 79

114 Qu offre UDP aux applications?! Support minimal au niveau transport Pas de retransmission de paquets perdus, pas de contrôle de flux, pas de contrôle de congestion! Un service à datagrammes Non orienté connexion - Pas de connexion! adapté au multicast Non fiable - Si nécessaire, la fiabilité peut être mise en œuvre au niveau applicatif 80 Datagrammes UDP et paquets IP! Unité d information (PDU) = datagramme! Pas de découpage en segments Une opération d écriture de l application = un datagramme UDP Risque de fragmentation par la couche IP (au niveau de l émetteur) Paquet IP Datagramme UDP En-tête IP En-tête UDP Données 81

115 En-tête d un datagramme UDP 32 bits octets N de port de la source N de port de la destination Longueur Checksum Données (s il y en a)! Numéros de port : pour le démultiplexage Numéros indépendants de ceux de TCP! Longueur (# 8) : en-tête UDP + données Info redondante (car : longueur données UDP = longueur données IP 8) 82 Calcul du checksum! Checksum UDP : optionnel (obligatoire pour TCP) Si non calculé : checksum = 0! Méthode : la même que pour TCP Somme en complément à 1 de mots de 16 bits En-tête + données + «pseudo en-tête» (IP) Pseudo en-tête En-tête UDP Adresse IP source Adresse IP destination 0 Protocole (= 17) Longueur UDP N de port de la source N de port de la destination Longueur UDP 0 Données Checksum 31 bourrage (seulement pour le calcul) 83

116 Plan : User Datagram Protocol Application DHCP ping traceroute Transport TCP UDP Réseau ARP IP ICMP Liaison Ethernet / IEEE Plan! Caractéristiques de TCP! Automate du protocole! Contrôle de flux! Contrôle de congestion! Performance de TCP! Évolution de TCP 85

117 Introduction à TCP (Transmission Control Protocol)! Exemples d applications/protocoles qui utilisent TCP HTTP (web), SMTP / POP / IMAP ( ), FTP, telnet, ssh...! Proportion de trafic TCP dans l Internet (McCreary et Claffy, 2000) 91% des octets 83% des paquets! Position dans un modèle en couches protocolaires Modèle OSI : couche 4 (transport) Modèle «Internet» : TCP Application UDP SCTP IP Liaison Physique utilisateur kernel (noyau du système d exploitation) 86 Définition du protocole : normes! Documents de l IETF RFC 793 (spécification de base) RFC 1122 (Requirements for Internet Hosts) RFC 2581 (Congestion Control) RFC 2988 (Retransmission Timer) RFC 1323 (Extensions for High Performance) RFC 2018, 2883 (Selective Acknowledgment)! Norme de facto : implémentation BSD Plusieurs milliers de lignes de code en C 87

118 Qu offre TCP aux applications?! Un canal transparent, sans erreurs et bidirectionnel (en mode fullduplex) qui transporte une séquence d octets Protocole de bout en bout et de point à point Service fiable «Flot d octets» (byte-stream) Service orienté connexion 88 De bout en bout et de point à point équipement terminal serveur web protocole http (niveau applicatif) équipement terminal client web TCP protocole TCP (de bout en bout) TCP IP protocole IP routeur (nœud) IP protocole IP IP protocole Ethernet Ethernet Ether. PPP protocole PPP PPP LAN Ethernet liaison série 89

119 Service fiable! TCP suppose que la couche inférieur (IP) n est pas fiable Acquittement (acknowledgement) des données reçues Retransmission des données perdues Contrôle de flux Ordonnancement des données qui arrivent «dans le désordre» Écartement des données dupliquées Vérification de l intégrité des données (checksum) 90 Flot d octets! TCP transporte des octets Flux «non structuré» : TCP peut découper les blocs de données applicatives en segments - L interprétation / délimitation de ces octets relève de l application, pas de TCP À chaque octet émis correspond un numéro de séquence - Numérotation des données envoyées! acquittement (ACKnowledgement) des données reçues 91

120 Segments TCP et datagrammes IP! Unité d information (PDU) = segment Les données de l application son coupées en blocs, transportés en segments TCP Chaque segment TCP est «encapsulé» dans un datagramme IP Datagramme IP Segment TCP En-tête IP En-tête TCP Données 92 En-tête d un segment TCP N de port de l émetteur N de port du récepteur N de séquence 20 octets Long. en-tête Réservé E C N N d acquittement (ACK) Drapeaux Taille de la fenêtre Checksum Pointeur données urgentes Options TCP (s il y en a) Données (s il y en a) 93

121 Numéros de port! Bien connus Du 0 au 1023 Services standards - Exemples : serveur telnet (23), ssh (22), http (80)...! Réservés Du 1024 au Exemple : Quake (26000)! «Ephémères» Du au Alloués dynamiquement par l application - Exemple : client ssh 94 Numéros de port et «démultiplexage»! Identification = (direction IP de A, port de A) + (direction IP de B, port de B) = socket A + socket B Exemple : deux connexions «en parallèle» à un serveur ssh - Serveur : port bien connu (22) - Client : ports éphémères ssh ssh sshd sshd port = port = TCP port = 22 port = 22 TCP client IP (dir = ) IP (dir = ) serveur 95

122 Drapeaux de l en-tête TCP! Utilisés pour la signalisation SYN FIN ACK RST PSH URG Synchroniser numéros de séquence aux deux extrémités de la connexion Le TCP émetteur du segment n a plus de données à envoyer Le champ «Numéro d acquittement» contient une valeur valable «Réinitialiser» (interrompre) la connexion Le TCP récepteur du segment doit passer les données à l application le plus rapidement possible Le champ «Pointeur données urgentes» es valable 96 N de séquence et d acquittement! N de séquence = premier octet de données du segment! N d acquittement = prochain octet que l émetteur de l ACK s attend à recevoir A n de séquence = X N octets de données [X, X+N 1] B! Connexion full-duplex Chaque extrémité de la connexion enregistre le n de séquence pour chaque direction ACK n d acquittement = X+N n de séquence = X+N M octets de données [X+N, X+N+M 1]

123 Mécanisme d acquittement (ACK)! Confirmation positive Le récepteur confirme ce qu il a reçu en séquence - Il ne notifie pas spontanément qu il lui manque des données - Il n indique pas explicitement ce qui lui manque! «Accumulation» Un ACK peut confirmer plus d un segment reçu Mécanisme delayed ACKs : envoyer p.ex. 1 ACK tous les 2 segments reçus [X, X+N-1] [X+N, X+N+M-1] ACK X ACK X+N+M [Y, Y+J-1] [Y+J, Y+J+K-1] ACK Y+J+K confirmation positive accumulation 98 Retransmission et acquittement de segments TCP écriture : N octets M octets K octets RTO retransmission séq = X séq = X+N ACK X séq = X+N+M séq = X ACK X ACK X+N+M +K+1 données hors de séquence (stockées) données hors de séquence (stockées) N+M+K octets passés à l application (ordonnés)! Service fiable = retransmission en cas de perte 99

124 Plan! Caractéristiques de TCP! Automate du protocole Ouverture de connexion Fermeture de connexion! Contrôle de flux! Contrôle de congestion! Performance de TCP! Évolution de TCP 100 Service orienté connexion! Avant de pouvoir envoyer des données, il faut établir une connexion Mécanisme de signalisation : three-way handshake - Analogie : appel téléphonique! Phases typiques d une connexion TCP Etablissement Échange de données Fermeture 101

125 Ouverture de la connexion application : ouverture active connexion établie client SYN j SYN k, ACK j+1 ACK k+1 serveur connexion établie application : ouverture passive Trois segments pour ouvrir la connexion: three-way handshake 102 Fermeture de la connexion application : fermeture active notifier fermeture à l application client serveur FIN m ACK m+1 FIN n ACK n+1 notifier fermeture à l application application : fermeture passive Quatre segments pour fermer la connexion 103

126 Connexion «demi-fermée» fermeture active (unilatérale) client FIN ACK du FIN serveur notifier fermeture application : lire données notifier fermeture ACK des données données application : écrire données FIN ACK du FIN fermeture passive connexion TCP : full-duplex (flux de données indépendant de l autre direction) [Stevens, 1994]

127 Plan! Caractéristiques de TCP! Automate du protocole! Contrôle de flux Fenêtre glissante! Contrôle de congestion! Performance de TCP! Évolution de TCP 106 Contrôle de flux dans TCP! De bout en bout! Objectifs Éviter la saturation du récepteur Exploiter au mieux la capacité du réseau - Il se peut que ce dernier objectif ne soit pas toujours satisfait

128 Contrôle de flux dans TCP! «Fenêtre glissante» Idée : - L émetteur peut envoyer des données sans avoir reçu d ACK de tout ce qu il a déjà émis (utilisation efficace de la capacité) tant que le récepteur puisse absorber les nouveaux données (contrôle de flux) Principe : chaque TCP annonce le nombre d octets qu il est disposé à recevoir, à compter du n d ACK Fenêtre annoncée par le récepteur Émetteur : Fenêtre utile octet n Envoyés et acquittés Envoyés mais pas acquittés Peuvent être envoyés sans attendre Ne peuvent pas encore être envoyés 108 Fenêtre glissante : évolution dans le temps! Le récepteur confirme l arrivée de données! la fenêtre se ferme En-tête : champ N de ACK (32 bits)! Le récepteur lit des données (acquittées)! la fenêtre s ouvre En-tête : champ Taille de la fenêtre (16 bits)! Fermeture + ouverture = la fenêtre se déplace «vers la droite» (glisse) se ferme s ouvre fenêtre d'émission 109

129 Contrôle de flux avec fenêtre glissante! Intuitivement : Plus élevée est la bande passante, et/ou Plus grand est le délai d aller et retour (RTT), Alors : plus grande doit être la fenêtre pour garantir que l émetteur puisse émettre «en continu» 110 Plan! Caractéristiques de TCP! Automate du protocole! Contrôle de flux! Contrôle de congestion Algorithmes : Slow start, Congestion avoidance, Fast retransmission, Fast recovery! Performance de TCP! Évolution de TCP 111

130 émetteur émetteur Congestion! Stockage de paquets : Dans les liaisons (BDP) Dans les routeurs (files d attente)! Exemple : lien intermédiaire à bas débit ; fenêtre = 20 Distance entre ACKs : donnée par le lien le plus lent Si la file d attente de R1 sature! paquets perdus Routeur R1 goulot d étranglement données ACKs Routeur R récepteur récepteur 112 [Stevens, 1994] À propos de la congestion! La fenêtre glissante garantit que l on n inonde pas le récepteur! Problème : lorsque le «goulot d étranglement» est dans le réseau, pas dans le récepteur Possibles causes - Lien à faible débit -! (taux d arrivée) > capacité du lien! (En principe ) On ne peut pas connaître l état de tous les nœuds intermédiaires 113

131 Contrôle de congestion dans TCP! Mis en œuvre par «l émetteur» Mécanismes : - Détection de la congestion - Réaction face à la congestion! (Jusqu à) Quatre algorithmes agissant conjointement Slow start Congestion avoidance Fast retransmit Fast recovery! Fenêtre d émission : wnd = min( rwnd, cwnd ) rwnd := fenêtre annoncée - Contrôle de flux piloté par le récepteur cwnd := fenêtre de congestion - Contrôle de flux piloté par l émetteur 114 Contrôle de congestion dans TCP : algorithmes! Slow start Commencer à émettre «avec précaution» - cwnd = 1 ou 2 segments Au début de la connexion (état du réseau =??) Lorsque le temporisateur de retransmission se déclenche Utiliser le taux d arrivée d ACKs afin d adapter le taux d envoi de segments (self-clocking) - Réseau peu chargé Réponse rapide du réseau - Réseau chargé Les ACKs prennent plus de temps pour arriver! le démarrage est lent! Congestion avoidance Une fois la congestion s est produite, (essayer de) éviter qu elle se reproduise Tout en utilisant la bande passante efficacement 115

132 Slow start! Algorithme : Avec chaque nouvel ACK reçu : cwnd ( cwnd + 1! En absence de pertes, cwnd croît (quasi)-exponentiellement Taux de croissance : (environ) # 2 par RTT cwnd (en segments TCP) Évolution de cwnd : version idéalisée cwnd congestion du réseau slow start intervalle RTO slow start + cong. avoid. ssthresh 1 fast recovery + congestion avoidance fast recovery + congestion avoidance temps (RTTs) 117

133 Évolution de cwnd : version plus réaliste cwnd congestion du réseau RTO ssthresh ssthresh 0 temps (RTTs) 118 Plan! Caractéristiques de TCP! Automate du protocole! Contrôle de flux! Contrôle de congestion! Performance de TCP! Évolution de TCP 119

134 SACK : Selective Acknowledgements! Objectif : améliorer le mécanisme d acquittement de TCP! Idée : confirmer non seulement la dernière donnée reçue en séquence, mais aussi des données reçues hors de séquence L émetteur peut mieux en déduire quelles sont les données manquantes! optimiser la retransmission 120 L avenir de TCP! De longues années devant lui? Inertie due à la «masse critique» (installed base) - OSs - Applications Difficultés à faire évoluer TCP - Conservatisme - Interopérabilité - Interaction avec middleboxes (NATs...) - Peut-on «casser l Internet»? Des idées de TCP ont été reprises pour SCTP - Mécanismes de contrôle de congestion, SACK - SCTP ) TCP nouvelle génération? 121

135 Plan : Stream Control Transport Protocol Application DHCP ping traceroute Transport SCTP TCP UDP Réseau ARP IP ICMP Liaison Ethernet / IEEE SCTP! Stream Control Transport Protocol (RFC 2960)! Protocole de transport fiable et orienté connexion! Points communs avec TCP : Communication point-à-point, full-duplex SACK Contrôle de flux et contrôle de congestion - ECN : optionnel 123

136 SCTP versus TCP TCP Orienté octet (byte-stream) Données applicatives : un seul flux SACK optionnel Numérotation des données par octets SCTP Orienté message Données applicatives : N flux possibles SACK obligatoire Numérotation des données par blocs (chunks) 124 Problème TCP : SYN flooding! Attaque d un serveur en l inondant de requêtes de connexion Réception d un SYN = création d un contexte dans le serveur (= ressources utilisées) serveur Les SYN+ACK ne parviennent pas à l attaquant ; épuisement de la mémoire du serveur attaquant... Envoi d une rafale de SYNs ; fausse adresse IP source 125

137 Solution SCTP : cookies! Quatre segments pour ouvrir la connexion (en jargon SCTP : «association») Cookie : information du contexte (cryptée) Le serveur ne crée le contexte que lorsqu il reçoit le cookie de retour serveur INIT COOKIE- ECHO (+ cookie) client INIT-ACK (+ cookie) COOKIE-ACK 126

138 Plan du cours C1 - C8 Modèle Internet vs. OSI, standardisation Les protocoles IP, ARP et ICMP Les protocoles TCP, UDP et SCTP Adressage et traduction d adresses Principes des protocoles de routage, le protocole RIP Introduction aux sockets et DNS PC1-2 PC3-4 TP1 Désassemblage de trames TCP/IP Plan d adressage IP Programmation sockets 127 Adressage IP (avant 1994)! Adressage par classes Espace d adressage plat - Pas de numérotation hiérarchique, pas de rapport entre adresse et localisation géographique Administration centralisée : NIC (Network Information Center)! Classes A, B, C : utilisation inefficace et peu flexible des adresses Environ 50% des réseaux classe B avec < 50 machines (Tanenbaum, 2003) 128

139 Adressage IP (avant 1994)! Épuisement des adresses classe B! Allocation de blocs d adresses classe C N allouer des classes B que lorsque cela se justifie vraiment (RFC 1466) Nombre total de préfixes Préfixes alloués % alloué Classe A Classe B Classe C (année 1993 ; source : RFC 1466) Adressage IP (avant 1994) 130 [?]

140 Adressage IP (avant 1994)! Épuisement des adresses classe B! plein d adresses classe C utilisées Un réseau = une ligne dans une table de routage! «explosion» des tables de routage des opérateurs - Mémoire requise dans les routeurs - Coût de gestion des tables (croît avec la taille de la table) - Échange de plus d infos entre routeurs, temps de convergence plus longs (protocoles de routage) 131 Adressage IP (avant 1994)! Allocation en classes différente? P.ex. : classe C avec 10 bits pour les machines - Moins de réseaux classe C, plus de machines par réseau (1022 adresses/ réseau versus 254) - Mais (toujours) absence de numérotation hiérarchique P.ex. : classe B avec 20 bits pour les réseaux - Plus de réseaux classe B, moins de machines par réseau (4094 adresses/ réseau versus 65534) - Mais explosion des tables de routage et (toujours) absence de numérotation hiérarchique

141 Adressage hiérarchique (post-1994)! CIDR (Classless Inter-Domain Routing) RFC 1518, 1519! Idée : Affecter les adresses qui restent en blocs de taille variable, selon les besoins - Au démarrage de CIDR : allocation dans la zone des ex-classe C Essayer de récupérer des préfixes déjà alloués dans p.ex. la zone des classes A! les réallouer en suivant ces règles 133 CIDR : principe! Adresse = préfixe + longueur de préfixe Disparition des classes : les premiers bits de l adresse n ont aucun rapport avec la «taille» du réseau! Exemples : (ancienne classe A)! / 8 ou 17 / (ancienne classe B)! / 16 ou / (ancienne classe C)! / 24 ou / / 16 (bloc de taille «classe B» pris dans l ancienne zone des classes C) 134

142 Allocation des préfixes : administration! Hiérarchie d instances administratives Coordination : IANA (Internet Assigned Numbers Authority) Regional Internet Registries (RIR) - APNIC (Asia Pacific Network Information Centre) - ARIN (American Registry for Internet Numbers) Amérique du Nord - LACNIC (Regional Latin-American and Caribbean IP Address Registry) - RIPE NCC (Réseaux IP Européens - Network Coordination Center) Europe, Moyen Orient, Asie centrale - AFRINIC Afrique et Océan Indien 135 Allocation des préfixes : administration! L IANA attribue des blocs de préfixes aux RIR! Les RIR allouent des préfixes aux LIR (Local Internet Registries)! Les LIR (p.ex. un ISP) distribuent des adresses à leurs clients [Toutain] 136

143 Allocation des préfixes IANA 62/8 80/7 193/8 194/7 RIPE-NCC Opérateur /16 Opérateur /14 Site /25 Site /24 Site /21 Opérateur /16 Site / [Toutain, 2003] Exemple d allocation par blocs xxx. xxxx xxxx. xxxx xxxx = 8 blocs de taille /16 morrocoy[10:27]% whois -h whois.ripe.net /13 % This is the RIPE Whois server.... inetnum: netname: FR-TELECOM descr: PROVIDER Local Registry descr: France Telecom country: FR admin-c: AB5579-RIPE tech-c: ML2808-RIPE tech-c: PG5119-RIPE status: ALLOCATED PA

144 Allocation des préfixes (suite) IANA 62/8 80/7 193/8 194/7 RIPE-NCC Opérateur /16 Opérateur /14 Site /25 Site /24 Site /21 Opérateur /16 Les adresses appartiennent aux opérateurs! Changement d opérateur = changement d adresses Client fortement lié à son opérateur Site / [Toutain, 2003] Allocation des préfixes (suite) IANA 62/8 80/7 193/8 194/7 RIPE-NCC Opérateur /16 Opérateur /14 Site /25 Site /24 Opérateur /16 Les adresses appartiennent aux opérateurs! ISP 3 lié également à son opérateur (2) Changer d opérateur = imposer un changement d adresses à tous ses clients Site /22 [Toutain, 2003] 140

145 Allocation des préfixes (suite) IANA 62/8 80/7 193/8 194/7 RIPE-NCC Opérateur /16 Opérateur /14 Site /25 Site /24 Site /21 Opérateur /16 Connexion à plus d un opérateur («multidomiciliation») = perte de la hiérarchisation dans l adressage Site / [Toutain, 2003] Agrégation des préfixes IANA 62/8 80/7 193/8 194/7 RIPE-NCC Opérateur /16 Opérateur /14 Site /25 Site /24 Site /21 Opérateur /16 Depuis l extérieur, l opérateur 2 (et tous les réseaux qui en dépendent) est vu comme : /14! simplification des tables de routage! l Europe vue comme 3 ou 4 préfixes? Site / [Toutain, 2003]

146 Agrégation des préfixes (suite) Opérateur A Opérateur B États-Unis Opérateur /16 Opérateur /14 Europe Site /25 Site /24 Site /21 Opérateur /16 sauf que (topologiquement) les choses sont plus compliquées (pas de réseau fédérateur au niveau des RIR) Site / [Toutain, 2003] Adresses particulières! Adresses d auto-configuration : de à ( /16)! Adresses privées Ne seront jamais attribuées Conseillées p.ex. pour un réseau IP non connecté à l Internet De à ( /8) De à ( /12) De à ( /16) Ex-classe A Ex-classe B Ex-classe C 144

147 Traduction d adresse : NAT! Network Address Translation (RFC 3022) Partage d un pool d adresse dans un LAN La taille du pool étant inférieur au nombre de client potentiel - Exemple : réseau d entreprise! Network Address and Port Translation Partage d une seule adresse IP entre plusieurs machines dans un LAN - Exemple : réseau domestique avec N ordinateurs et une seule adresse Solution à court terme au problème du manque d adresses? Dans la suite nous utiliserons le terme générique NAT pour les deux cas 145 NAT : principe de fonctionnement! À l'intérieur du LAN, chaque machine a une adresse IP différente Utilisation d adresses privées - Typiquement dans la plage / 8! À l extérieur, une (ou un petit nombre de) adresse(s) IP est utilisée Adresse(s) publique(s)! Interface LAN # Internet = boîtier NAT Conversion dynamique des adresses dans les paquets venant de / allant sur l Internet Implémentation typique (contexte domestique) : NAT + firewall (+ routeur (+ borne WiFi)) 146

148 NAT : principe de fonctionnement (suite) Adresse source privée Adresse source publique Exemple de traduction pour un paquet sortant 147 Dessin : [Tanenbaum, 2002] NAT : principe de fonctionnement (suite)! Sens intranet " Internet : correspondance non ambiguë! Sens Internet " intranet : unicité?? Une possible solution (irréaliste!) : option IP contenant l adresse privée de la source - Pour fonctionner, devrait être mise en œuvre partout dans l Internet!! Pour enlever l'ambiguïté : numéros de port du protocole de transport La plupart des paquets IP contiennent une PDU d un protocole de transport (TCP, UDP, voire SCTP et DCCP) utilisant des numéros de port Ne demande aucune modification à / option de : IP / TCP / application

149 NAT : correspondance entre adresses! Exemple : deux clients voulant se connecter à un serveur sur l Internet, en passant par un NAT PC PC NAT Internet LAN (adresses privées) Préfixe : / NAT : correspondance entre adresses (suite) en-tête src = dst = checksum IP champs src = dst = checksum IP en-tête TCP port source = port dst = 80 checksum TCP port source = port dst = 80 checksum TCP source privé port source privé port source publique

150 NAT : correspondance entre adresses src = dst = checksum IP port source = port dst = 80 checksum src = dst = checksum IP port source = port dst = 80 checksum TCP source privé port source privé port source publique Nouveau flux (= nouvelle connexion TCP) : ajout d une ligne à la table de correspondances 151 NAT : correspondance entre adresses (suite)! Qu en est-il des connexions entrantes? Exemple : serveur avec adresse privée devant être accessible depuis l Internet Solution : configuration manuelle du NAT serveur ftp 152

151 NAT : avantages! Économie d adresses IP Prolonger l échéance de la «fin du monde IPv4»! Transparent pour les applications! Transparent pour les piles protocolaires des machines! Adresses privées = changement de fournisseur d accès sans changement du plan d adressage 153 NAT : inconvénients / critiques! Non respect du principe «une adresse IP = une machine unique»! Réseau IP = réseau (pseudo) orienté connexion?? Le réseau (= le boîtier NAT) garde de l info (= correspondance d adresses et ports) pour chaque connexion, créée au moment de l établissement - Plantage d un routeur! (en principe) transparent aux connexions - Plantage du NAT! toutes les connexions sont forcément coupées 154

152 NAT : inconvénients / critiques (suite)! Non respect du principe de couches protocolaires La couche k ne doit pas se baser sur ce que la couche k + 1 a mis dans sa PDU - Exemple : taille des numéros de port?! Si nouveau protocole de transport! modification nécessaire SCTP? 155 NAT : inconvénients / critiques (suite)! Applications qui introduisent des adresses IP dans les données applicatives Exemple : ftp - Adresses IP codées en ASCII Besoin d analyser les données " : des caractères en moins! modification des en-têtes (longueur, checksum...) Si nouvelle application! modification nécessaire! Retarder le déploiement de la véritable solution (= IPv6) au problème de l espace d adressage? 156

153 Auto-configuration : protocole DHCP Application DHCP ping traceroute Transport TCP UDP Réseau ARP IP ICMP Liaison Ethernet / IEEE Configuration automatique! Obtenir les paramètres IP indispensables adresse IP + préfixe adresse du routeur par défaut! Objectif de gestion de parc Si on laisse les utilisateurs configurer leur station - Cauchemar pour les administrateurs réseau - Source d erreur difficile à identifier (ex. netmask Des enjeux de sécurité - Contrôler l accès au réseau Découverte automatique de services - Serveur de nom - Imprimante, serveurs de fichiers AR LAN /24 AddrIP: Netmask: GW: WINS addr, 158

154 DHCP (Dynamic Host Configuration Protocol)! Dynamic Host Configuration Protocol - RFC 1541, RFC 3131 Fonctions - Allocation d adresse IP - Configuration de la pile IP - Initialisation de paramètres divers - Récupération de fichier de boot Au dessus de UDP! Évolution de BootP : Bootstrap Protocol - RFC 951, RFC 1532 Utiliser pour permettre à une station de booter par le réseau Au dessus de UDP, utilise TFTP 159 DHCPv4 Allocation d adresse! Trois types d allocation d adresses Automatique - Adresse choisie dans un pool - Réservée à une adresse MAC Manuelle - L administrateur réseau rempli un fichier de configuration - Adresse MAC "Adresse IP - Pour des questions de sécurité Dynamique - Adresse choisie dans un pool - Pour une durée déterminée - Méthode utilisée par les ISP IP address allocation Static Dynamic Manual Automatic Dynamic 160

155 DHCPv4 Mise en œuvre! Sur le serveur Activation du service - Démon dhcpd sous Unix - Gestion des bails (leases) Fichier de configuration - Stocke les informations concernant le réseau - Stocke les informations concernant chaque client Un client = <adresse IP, adresse MAC> Le serveur stocke aussi les adresses attribuées - Le client peut demander d autres informations! Sur le client Activation du service - Un démon dhcpcd sous Unix - Une case à cocher dans un panneau de configuration Gestion des rafraîchissement 161 DHCPv4 Fonctionnement AR LAN /23 DHCP Server A DHCP Client DHCP Server B 1. ( , , DISCOVER) 3. ( , , REQUEST) AddrIP: Netmask: GW: WINS addr, 162

156 DHCPv4 Les messages! Les requêtes des clients DISCOVER (1) - Demande d allocation d adresse IP - Liste de paramètres demandés par le client (nom de domaine, masque sous réseau, DNS, serveur WINS, ) REQUEST (3) - Réponse à un message OFFER (option «server identifier») ou renouvellement d un bail - Un serveur non choisi libère les paramètres sélectionnés par le client DECLINE (4) - Averti que l adresse allouée est déjà utilisée RELEASE (7) - Libération d une adresse IP 163 DHCPv4 Les messages! Les réponses du serveur OFFER (2) - Réponse d un serveur à un message DISCOVER - Contient les premiers paramètres ACK (5) - Acquittement, contient les paramètres et l adresse IP allouée au client NAK (6) - Signale l échéance d un bail ou la réception de mauvais paramètres réseaux du client Par exemple le client propose un nom ou une adresse IP invalide.! Rem : sont codés comme des options de bootp option 53 (53, 1, 1-7) 164

157 DHCPv4 Format de la trame Msg type Addr MAC type Addr MAC lenght Hop count Transaction ID Number of seconds Client IP addr Your IP addr Server IP addr Gateway IP addr Client MAC addr (16 bytes) Server name (64 bytes) Boot file name (128 bytes) Msg Type: 1=request, 2=response Addr Mac Type: Ethernet Addr Mac Length: 6 for Ethernet Specific vendor information (64 bytes) PDU DHCPv4 165 DHCPv4 Les options (extrait)! Subnet Mask! Time Offset Différence heure sous réseau et heure universelle! Router des routeurs! Name Server Liste serveurs de noms! LPR server Liste serveurs d impression! Ressource Location Server Liste serveurs de localistion des ressources! Host Name Nom du poste de travail! Domain Name Nom du domaine utilisé! Default IP Time-to-Live! Interface MTU! Broadcast Address! Router Solicitation Address à utiliser pour les messages de sollicitation de routeurs! Static Route 166

158 La configuration automatique en IPv6! Avec IPv6 la configuration des hôtes doit être automatique (Plug and Play) Utilisation des messages ICMPv6 Demande d information lors du démarrage - Préfixe réseau - Routeur par défaut -! Seuls les routeurs doivent être configurés Bientôt plus nécessaire avec la délégation de préfixe! L adresse IPv6 des stations est obtenue automatiquement Mais pas d enregistrement dans le DNS Nécessité d un DNS dynamique pour les serveurs! Mais avant de parler d autoconfiguration, quelques rappels sur IPv6 167 Le cas d IPv6 Router LAN Préfixe P [RS, IP destination=ff02::2, IP source=@llb] f(p,macb) => IPb DHCPv6 pour des paramètres supplémentaires 168

159

160 Plan du cours C1 - C8 Modèle Internet vs. OSI, standardisation Les protocoles IP, ARP et ICMP Les protocoles TCP, UDP et SCTP Adressage et traduction d adresses Principes des protocoles de routage, le protocole RIP Introduction aux sockets et DNS PC1-2 PC3-4 TP1 Désassemblage de trames TCP/IP Plan d adressage IP Programmation sockets 169 Plan du cours protocoles de routage Application OSPF RIP http ping Transport UDP TCP Réseau ARP IP ICMP Liaison Physique Ethernet IEEE (+ LLC + SNAP) PPP 170

161 Routage : principes théoriques! Un réseau peut être représenté par un graphe Exemple : - Sommet = routeur - Arc = voie de communication entre routeurs (liaison point-à-point, LAN...)! Protocoles de routage : basés sur des algorithmes du plus court chemin (shortest path) Dijkstra Bellman-Ford 171 Graphe dirigé Un nombre (= poids) associé à chaque arc 172

162 Critères de coût pour le routage! Le «plus court» chemin = celui qui minimise une fonction de coût donnée Fonction de coût d un chemin = combinaison arbitraire de : - La distance topologique (= nombre de sauts) - La capacité des liens - La charge des liens - Un coût monétaire par lien - La fiabilité des liens Critères de coût pour le routage (suite)! «Distance» du chemin minimale (minimum path distance) Minimiser le nombre de sauts (arcs, réseaux) parcourus! «Longueur» du chemin minimale (minimum path length) Minimiser une fonction de coût plus complexe associée au chemin Distance minimale = cas particulier de la longueur minimale (coût unitaire sur tous les arcs) 174

163 Dijkstra versus Bellman-Ford! Dijkstra : À chaque sommet : connaissance globale du réseau nécessaire (coût de tous les arcs du graphe)! Bellman-Ford À chaque sommet : connaissance locale (info des voisins, coût des arcs incidents)! Si graphe «statique» (= la topologie et les coûts ne changent pas) : les deux algorithmes convergent vers la même solution 175 Traitement au niveau IP dans une machine Unix BSD Commande netstat Commande route Démon de routage TCP UDP ICMP ICMP redirect oui Table de routage Sortie IP : déterminer le prochain routeur (si nécessaire) Le relayer (si relayage activé) non Ce paquet est-il pour moi? (une adresse IP à moi, ou bien adresse de diffusion) Buffer IP Interfaces réseau 176 [Stevens, 1994]

164 Table de routage! Recherche dans la table une adresse qui correspond à l adresse destinataire : Adresse de station (correspondance complète) Adresse de réseau (correspondance partielle) Routeur par défaut (pas de correspondance)! Remplissage de la table Manuel (statique) Automatique (dynamique) 177 Adressage hiérarchique (post-1994)! Relayage (forwarding) sans classes : plus complexe Implémentation possible dans un routeur «pre-cidr» : recherche à deux niveaux - Décider de quelle classe est l adresse (= masque de 4 bits) - Chercher dans la table correspondante à cette classe Implémentation dans un routeur «post-cidr» : - Pas de classes! une seule table pour toutes les adresses (= «masque» de 32 bits) - Longest prefix match : l entrée (= préfixe) avec le maximum de bits en commun avec l adresse est choisie 178

165 Routage dynamique (ou adaptatif)! Les conditions dans le réseau changent! les routes changent Routage dynamique = politique de routage - Choix entre multiples routes possibles - Sélection d une nouvelle route! Demande l échange d information entre routeurs! protocole de routage! Avantages Fiabilité : robustesse face aux pannes Peut aider à diminuer la congestion 179! Problèmes Routage dynamique (ou adaptatif) Complexité accrue - Décision de routage = coût de calcul Information pour le routage : collectée à un endroit, utilisée dans un autre - Plus d info échangée = meilleur décision - Plus d info échangée = plus de trafic Vitesse d adaptation - Trop rapide = instabilité des routes - Trop lente = inutile Fluttering Looping 180

166 Fluttering! Oscillations très rapides dans le routage Des paquets successifs qui prennent des routes différentes Cause possible : load splitting dans les routeurs (plus d une route disponible + partage de la charge sur chacune des routes) Exemple (Paxson 1997) : 181 Dessin : [Stallings, 2002] Fluttering (suite)! Problèmes Estimation de la capacité disponible? Estimation des temps d aller-retour? Arrivée des paquets hors séquence + temps de propagation différents = retransmissions inutiles (TCP) 182

167 Looping! Boucles dans le routage Cause possible : changement dans la connectivité + temps de mise à jour > 0 = temps de convergence > 0 Exemple (Paxson 1997) : - 60 boucles observées au long de 3 jours d observation - Quelques unes très persistantes (durée > 1/2 journée) 183 Asymétrie du routage! Le chemin de A à B ne passe pas forcément par les mêmes routeurs que de B à A! Problème pour p.ex. RSVP Réservation de ressources faite par les messages de retour (RESV)! asymétrie =??? Création d un contexte lors de la réception des messages PATH : nécessaire (mémoriser le chemin de retour) source RESV routeur RESV routeur RESV récepteur PATH PATH PATH 184

168 Systèmes autonomes (AS) 185 Dessin : [Stallings, 2002] AS et protocoles de routage! Routage interne : un IGP (Interior Gateway Protocol) Objectif : faire arriver les paquets à leur destination de façon efficace (connectivité : le plus important) Chaque AS peut utiliser un protocole interne différent Découverte des autres routeurs! Routage externe : un EGP (Exterior Gateway Protocol) Objectif : le même que pour un IGP + s occuper de politiques de routage (à la rigueur plus importantes que la connectivité!) Commun entre AS différents Topologie connue, les routeurs «se connaissent» 186

169 Allocation des numéros d AS! Les organismes qui attribuent les adresses IP donnent aussi un numéro d AS à chaque site Codé sur 16 bits Utilisé par le protocole de routage externe Plages de valeurs pour chacun des RIR + numéros à usage privé 187 Quelques protocoles de routage! Interne Vision locale du réseau (de l AS) : RIP - Protocole à vecteur de distance Vision globale du réseau (de l AS) : OSPF, ISIS - Protocoles à état des liens! Externe BGP - Protocole à vecteur de distance - Implante les politiques de routage 188

170 Plan du cours protocoles de routage Application OSPF RIP http ping Transport UDP TCP Réseau ARP IP ICMP Liaison Physique Ethernet IEEE (+ LLC + SNAP) PPP 189 RIP! Routing Information Protocol RFC 1058 (version 1), RFC 2453 (version 2) Peuvent interopérer! Protocole de type «vecteur de distances» (distance-vector)! Algorithme de routage : Bellman-Ford distribué! Échange de messages de routage : sur UDP Port bien connu pour RIP :

171 Routage distance-vector : principe! Nœuds voisins = nœuds connectés au même réseau! Chaque routeur (ou machine) doit échanger des infos avec tous ses voisins! Chaque lien peut avoir un coût associé Dessin : [Stallings, 2002] 191 Routage distance-vector : variables! Gérées par un routeur x attaché directement à M réseaux (avec : N = nombre total de réseaux) W x = [ w(x,1)! w(x, M) ] = vecteur des coûts des liens!! L x = [ L(x,1)! L(x,N) ] = vecteur des distances avec : L(x,i) = distance entre x et le réseau i R x = [ R(x,1)! R(x,N) ] = vecteur des routeurs voisins avec : R(x,i) = routeur permettant d'atteindre le réseau i! 192

172 Routage distance-vector : algorithme (suite)! Périodiquement : échanger le vecteur de distances L x avec ses voisins! Recalculer le vecteur L x selon : distance minimale pour aller de x à j distance minimale pour aller de y à j coût pour aller de x à y L(x, j) = min( L(y, j) + w(x,n xy) ), j =1,,N y "A avec : A = ensemble des routeurs voisins au noeud x N xy = réseau reliant x au routeur y 193! Routage distance-vector : algorithme (suite)! Obtenir le vecteur R x selon : R(x, j) = arg min( L(y, j) + w(x,n xy )), j =1,,N y "A! y = routeur minimisant la distance pour aller de x à j 194

173 Routage distance-vector : métrique! Typiquement : coût des liens = unitaire! métrique de distance = nombre de sauts (= nombre de réseaux traversés) w(x,i) =1 "x,i # L(x, j) = min( L(y, j) +1), j =1,,N y $A 195 RIP : détails pratiques! Opération distribuée entre nœuds coopératifs! Conséquences : Initialisation et mise à jour incrémentale Détection des changements de topologie - Problème du «comptage jusqu à l infini» (counting to infinity) - Solutions : split horizon / poisoned reverse 196

174 RIP : initialisation! Vecteur de distances initialisé selon : # w(x, j) =1 si x est connecté directement à j L(x, j) = $ %" sinon! Mise à jour : comme décrite auparavant, sauf que...! 197 RIP : mise à jour incrémentale! L algorithme précédent suppose implicitement : La mise à jour de L x dans un temps ) 0! réception, dans un temps ) 0, des vecteurs de distances L y de tous les voisins y! Dans la réalité : Opération asynchrone =? Pertes de messages (UDP) =? L(x, j) = min( L(y, j) +1), j =1,,N y "A On n a pas forcément toutes les valeurs (récentes) de L(y, j) *y + A 198

175 RIP : mise à jour incrémentale (suite)! Mise à jour lors de la réception de n importe quel vecteur de distances! Règles : Un vecteur reçu inclut un nouveau réseau! ajouter l info à la table de routage Un vecteur reçu indique une métrique plus petite pour atteindre un réseau! remplacer la route existante On reçoit un vecteur du routeur R et l on a " 1 routes dont R est le prochain saut! mettre à jour toutes ces routes 199 RIP : changement de topologie! Détection grâce aux échanges périodiques de mises à jour (toutes les 30 s)! Le routeur K doit passer par le routeur J pour atteindre le réseau i K ne reçoit pas de mise à jour de J pendant 180 s (= 6 mises à jour non arrivées) - Plantage / panne de J? - Congestion? (transport sur UDP) K marque la route comme non valide K remplace l ancienne route vers i par une nouvelle lorsqu il apprend qu un voisin a une route valide! Pratique : distance infinie (= routeur ne pouvant pas être atteint) codée par la valeur

176 RIP : changement de topologie (suite)! Convergence très lente lorsque la topologie change (counting to infinity) : jusqu à 16 minutes Au début : L(B,5) = 2 Ensuite, D tombe en panne : 1. B détecte qu il ne peut plus atteindre 5 en passant par D 2. B reçoit une annonce p.ex. de A, avec L(A,5) = 3! B met à jour : L(B,5) = 3+1 = 4 et annonce cette distance à A et C 3. A et C reçoivent cette annonce! L(A,5) = L(C,5) = 4+1 = 5 4. B reçoit ces annonces! L (B,5) = 5+1 = 6 5. etc. 201 Dessin : [Stallings, 2002] RIP : changement de topologie (suite)! Problème du counting to infinity : chacun pense qu il faut passer par l autre pour atteindre un réseau donné! Pour l éviter : split horizon / poisoned reverse Poisoned reverse : envoyer une mise à jour avec métrique = 16 au voisin qui me l a envoyé 202

177 RIP : changement de topologie (suite)! Split horizon (horizon coupé) : ne pas envoyer de mise à jour au voisin par lequel la route passe (car le voisin doit être plus près de la destination) A n envoie pas à B la distance L(A,5) C n envoie pas à B la distance L(C,5) 203 Dessin : [Stallings, 2002] Format des paquets RIP commande version (1) 0 famille d adresses (2 = IP) 0 adresse IP 0 20 octets 0 métrique (de 1 à 16) (jusqu à 24 routes de plus, même format que ci-dessus) 204

178 Format des paquets RIP (suite)! Commande : 1 = requête, 2 = réponse - Requête au démarrage : demande de table de routage complète (commande = 1, famille d adresses = 0, métrique = 16) - Réponse : À une requête Envoi périodique des tables de routage (modifiées ou non) Envoi des tables de routage, si modifiées» Attente aléatoire (de 1 à 5 s) avant émission 205 Format des paquets RIP (suite)! Adresse IP indiquant : Réseau Sous-réseau Station Route par défaut (= )! Si réseau à diffusion (ex. Ethernet)! messages envoyés en diffusion (broadcast) 206

179 Format des paquets RIP commande version (2) famille d adresses (2 = IP) domaine de routage marque (tag) de la route adresse IP masque de sous-réseau 20 octets adresse IP du prochain routeur métrique (de 1 à 16) (jusqu à 24 routes de plus, même format que ci-dessus) 207 Format des paquets RIP-2 (suite)! Domaine de routage : Définition de plusieurs politiques de routage dans un même réseau local - Les routeurs ignorent les messages avec un numéro de domaine différent Dessin : [Toutain, 2003] 208

180 Format des paquets RIP-2 (suite)! Marque de la route : Support des protocoles de routage externe (numéro d AS)! Adresse du prochain routeur : Destination à être insérée dans la table de routage - Si = 0, alors la route passe par l émetteur du message! Si réseau à diffusion (ex. Ethernet)! messages envoyés en diffusion restreinte (multicast) Adresse = Un routeur RIP-2 reçoit les paquets RIP-1 (en broadcast), mais pas inversement 209 RIP : limitations et problèmes! Restreint à de petits réseaux Dans la pratique : L(x, j) < 16 *x, j Émission des tables de routage complètes! Métrique trop simpliste = routage sous-optimal Contraintes plus sophistiquées? (délai, bande passante...)! RIP-1 : un routeur peut accepter des mises à jour en provenant de n importe quelle source (pas d authentification) Solutions dans RIP-2 : mot de passe en clair ou secret partagé 210

181

182 Plan du cours C1 - C8 Modèle Internet vs. OSI, standardisation Les protocoles IP, ARP et ICMP Les protocoles TCP, UDP et SCTP Adressage et traduction d adresses Principes des protocoles de routage, le protocole RIP Introduction aux sockets et DNS PC1-2 PC3-4 TP1 Désassemblage de trames TCP/IP Plan d adressage IP Programmation sockets 211 DNS (Domain Name Service)! Comment identifier une machine? Adresse IP - Ex. : Adaptée au traitement automatisé (routeurs, machines, etc.) - Difficile à retenir par des humains Nom - Ex. : - Facile à retenir! Comment faire la <=> nom? (Approche historique) Fichier avec table de correspondances stockée dans chaque machine Quel(s) problème(s) voyez-vous à cette approche? 212

183 DNS (Domain Name Service)! Solution : utiliser un service d annuaire distribué DNS : standard Internet (RFC 1034, 1035) Le DNS est : - Une base de données (annuaire) distribuée - Un protocole pour gérer et accéder à cette base de données! Principes du DNS Espace de nommage - Arborescence Architecture client-serveur - Hiérarchie de serveurs - Mise en cache des <=> nom 213 Services offerts par le DNS! <=> nom Dans les 2 sens! Aliases Une machine peut avoir plusieurs noms, dont un «canonique» % dig vsp-webcache-11.enst-bretagne.fr vsp-webcache-11.enst-bretagne.fr IN A! % dig -x in-addr.arpa IN! PTR! vsp-webcache-11.enst-bretagne.fr. % dig mail.google.com mail.google.com.! ! IN! CNAME! googl .l.google.com. googl .l.google.com. 142! IN! A!

184 Services offerts par le DNS (suite)! Aliases Alias pour les serveurs d % dig gmail.com gmail.com.!! 40! IN! A! % dig gmail.com MX gmail.com.!! 443! IN! MX! 10 alt1.gmail-smtp-in.l.google.com. gmail.com.!! 443! IN! MX! 5 gmail-smtp-in.l.google.com. gmail.com.!! 443! IN! MX! 40 alt4.gmail-smtp-in.l.google.com. gmail.com.!! 443! IN! MX! 20 alt2.gmail-smtp-in.l.google.com. gmail.com.!! 443! IN! MX! 30 alt3.gmail-smtp-in.l.google.com. alt1.gmail-smtp-in.l.google.com. 142 IN!A! alt3.gmail-smtp-in.l.google.com. 150 IN!A! gmail-smtp-in.l.google.com. 96! IN! A! alt2.gmail-smtp-in.l.google.com. 61 IN! A! alt4.gmail-smtp-in.l.google.com. 188 IN!A! Services offerts par le DNS (suite)! Répartition de charge entre serveurs applicatifs Un nom canonique => un ensemble IP Des requêtes successives vont retourner cet ensemble mais reordonné % dig ! IN! A! ! IN! A! ! IN! A! ! IN! A! ! IN! A! ! IN! A! % dig ! IN! A! ! IN! A! ! IN! A! ! IN! A! ! IN! A! ! IN! A!

185 Espace de nommage racine (nom : chaîne vide) domaines de 1e niveau (top level domains ou TLD) { arpa org edu mil int com gov net eu uk fr inaddr domaines génériques (gtld) ibm cnn ac domaines des pays (cctld) 192 zurich www ucl www cs www in-addr.arpa. (dans requête PTR pour trouver le nom de : ) DNS : Architecture client-serveur! Fonctionnement du DNS, «vu d en haut» 1. L utilisateur tape l URL sur son navigateur 2. Le navigateur passe l URL au client DNS (resolver) Appel à la fonction gethostbyname() ou similaire 3. Le client DNS envoie une requête pour à un serveur DNS 4. Le client reçoit une réponse du serveur, avec IP de 5. Le client retourne IP au navigateur 6. Le navigateur ouvre une connexion TCP avec 218

186 DNS : Architecture client-serveur! Types et hiérarchie de serveurs Racine TLD (top-level domain) Authoritative Local! Types de requêtes (1) : récursive - Delègue la recherche au serveur qui la reçoit (2), (4), (6) : itérative - La réponse indique qui faut-il contacter en suite - «Je connais pas ce nom, mais tu peux demander à ce serveur-ci» 219 [Kurose & Ross 2003] Serveurs racine (root servers)! 13 serveurs, nommés «A-root» à «M-root» Combien d adresses IP différentes pour les serveurs racine? 220 [

187

Montrer encore