TCP/IP Architecture, Design, and Implementation in Linux
Author: Sameer Seth (Author), M. Ajaykumar Venkatesulu (Author)
ASIN: 0470147733
Publisher finelybook 出版社: Packt Publishing
Edition 版本: N/A
Publication Date 出版日期: 2008-12-10
Language 语言: English
Print Length 页数: 800 pages
ISBN-10: 1838982469
ISBN-13: 9781838982461
Book Description
From the Inside Flap
As open source software becomes a trusted part of business and research systems, it’s no wonder that a combination of the Transmission Control Protocol/Internet Protocol (TCP/IP) and the Linux operating system is becoming more common. TCP/IP’s prevalence allows easy communication among computers using various operating systems, whether Windows, Mac OS, Linux, or Unix. And Linux—because it is open source and thus modifiable—has become a frequent choice for developers who want a customizable operating system on which to build their applications.
This book describes the design and implementation of TCP/IP in Linux, from simple client-server applications to more complex executions. Topical coverage includes:
-
Basic socket concepts and implementations
-
The Linux implementation of network packets
-
TCP read/write
-
TCP algorithms for data transmission and congestion control
-
TCP timers
-
IP layer and routing tables implementation
-
IP forwarding and quality of service implementation
-
Netfilter hooks for the stacks
-
Network Soft IRQ
-
How to debug a TCP/IP stack
All topics are discussed in a concise, step-by-step manner and the book is complemented with helpful illustrations to give readers a better understanding of the subject. TCP/IP Architecture, Design, and Implementation in Linux is an indispensable resource for embedded-network product developers, network security product developers, IT network architects, researchers, and graduate students.
From the Back Cover
As open source software becomes a trusted part of business and research systems, it’s no wonder that a combination of the Transmission Control Protocol/Internet Protocol (TCP/IP) and the Linux operating system is becoming more common. TCP/IP’s prevalence allows easy communication among computers using various operating systems, whether Windows, Mac OS, Linux, or Unix. And Linux―because it is open source and thus modifiable―has become a frequent choice for developers who want a customizable operating system on which to build their applications.
This book describes the design and implementation of TCP/IP in Linux, from simple client-server applications to more complex executions. Topical coverage includes:
-
Basic socket concepts and implementations
-
The Linux implementation of network packets
-
TCP read/write
-
TCP algorithms for data transmission and congestion control
-
TCP timers
-
IP layer and routing tables implementation
-
IP forwarding and quality of service implementation
-
Netfilter hooks for the stacks
-
Network Soft IRQ
-
How to debug a TCP/IP stack
All topics are discussed in a concise, step-by-step manner and the book is complemented with helpful illustrations to give readers a better understanding of the subject. TCP/IP Architecture, Design, and Implementation in Linux is an indispensable resource for embedded-network product developers, network security product developers, IT network architects, researchers, and graduate students.
About the Author
M. Ajaykumar Venkatesulu is currently working on networking and naming services. He has seven years of experience with Linux networking and kernel in research and commercial environments. His areas of interest include Linux kernel, embedded systems, IP routing, and IP QoS.
Excerpt. © Reprinted by permission. All rights reserved.
TCP/IP Architecture, Design and Implementation in Linux
By S. Seth M.A. Venkatesulu
John Wiley & Sons
Copyright © 2008 IEEE Computer Society
All right reserved.
ISBN: 978-0-470-14773-3
Chapter One
INTRODUCTION
Internetworking with Linux has been the most popular choice of developers. Not only in the server world where Linux has made its mark but also in the small embedded network OS market, Linux is the most popular choice. All this requires an understanding of the TCP/IP code base. Some products require implementation of firewall, and others require implementation of IPSec. There are products that require modifications in the TCP connection code for load balancing in a clustered environment. Some products require improving scalability on SMP machines. Most talked about is the embedded world, where networking is most popular. Real-time embedded products have very specific requirements and need huge modifications to the stack as far as buffer management is concerned or for performance reasons. All these require a complete understanding of stack implementation and the supporting framework.
As mentioned above, some of the embedded networking products require a minimum of the code to be complied because of the memory requirements. This requirement involves knowledge of source code organization in the Linux source distribution. Once we know how the code is distributed, it becomes easier to find out the relevant code in which we are interested.
Mostly all the networking application work on very basic client-server technology. The server is listening on a well-known port for connection requests while the client is sending out connection request to the server. Many complex arrangements are made for security reasons or sometimes for load balancing to the client-server technology. But the basic implementation is a simple client-server program in which the client and server talk to each other. For example, telnet or ftp services are accessed through the inet program which hides all the details of services. There are many tunable parameters available to tune your TCP/IP connections. These can be used to best tune the connection without disturbing overall system wide tuning.
Most of the network applications are written to exchange data. Once a connection is established, either (a) the client sends data to the server or (b) data flow in the opposite direction or may flow in both directions. There are different ways to send and receive data over the connection. These different techniques may differ in the way that application blocks once the socket connection either receive or send data.
In the entire book we discuss only TCP and no other transport protocol. So, we need to understand the TCP connection process. TCP is a connection-oriented protocol that has a set process for initializing connections, and similarly it has a set process for closing connection cleanly. TCP maintains state for the connection because of handshakes during connection initiation and closure processes. We need to understand the TCP states to completely understand the TCP connection process.
In this chapter we will present an overview of how the TCP/IP protocol stack is implemented on Linux. We need to understand the Linux operating system, including the process, the threads, the system call, and the kernel synchronization mechanism. All these topics are covered though not in great detail. We also need to understand the application programming interface that uses a TCP/IP protocol stack for data transmission, which is discussed. We discuss socket options with kernel implementation. Finally, we discuss the TCP state, which covers a three-way handshake for opening connection and a four-way handshake for connection closure.
1.1 OVERVIEW OF TCP/IP STACK
Let’s see how the TCP/IP stack is implemented on Linux. First we just need to understand the network buffer that represents the packet on Linux. sk_buff represents the packet structure on Linux (see Fig. 1.1). sk_buff carries all the required information related to the packet along with a pointer to the route for the packet. head, data, tail, and end point to the start of the data block, actual start of data, end of data, and end of data block, respectively. skb_shared_info object is attached at the end of the sk_buff header which keeps additional information about paged data area. The actual packet is contained in the data block and is manipulated by data & tail pointers. This buffer is used everywhere in the networking code as well as network drivers. Details are discussed in Chapter 5.
Now we will have a look at how the stack is implemented in Linux. We will first start with down-the-stack processing of the packet from the socket layer to the driver layer and then move up the stack. We will take an example of sending TCP data down the stack. In general, more or less the same stack is used for other transport protocols also, but we will restrict our discussion to TCP only.
1.1.1 Moving Down the Stack
When an application wants to write data over the TCP socket, the kernel reaches the socket through VFS (see Fig. 1.2). inode for the file of the type socket contains a socket object, which is the starting point for the networking stack (see Section 3.2 for more details). The socket object has a pointer to a set of operations specific to the socket type pointed to by field ops. Object proto_ops has a pointer to socket-specific operations. In our case, the socket is of type INET, so send systemcall ends up calling inet_sendmsg inside kernel via VFS. The next step is to call a protocol-specific send routine because there may be different protocols registered under INET socket (see Section 3.1). In our case, transport later is TCP, so inet_sendmsg calls a protocol-specific send operation. The protocol-specific socket is represented by a sock object pointed to by the sk field of the socket object. A protocol-specific set of operation is maintained by a proto object pointed to by prot field of sock object. inet_sendmsg calls a protocol-specific send routine, which is tcp_sendmsg.
In tcp_sendmsg, user data are given to a TCP segmentation unit. The segmentation unit breaks big chunks of user data into small blocks and copies each small block to sk_buff. These sk_buffs are copied to the socket’s send buffer, and then the TCP state machine is consulted to transmit data from socket send buffer. If the TCP state machine does not allow sending new data because of any reasons, we return. In such a case, data will be transmitted later by a TCP machine on some event which is discussed in Section 11.3.11.
If the TCP state machine is able to transmit sk_buff, it sends a segment to the IP layer for further processing. In the case of TCP, sk -> tp -> af_specific -> queue_xmit is called, which points to ip_queue_xmit. This routine builds an IP header and takes an IP datagram through the firewall policy. If the policy allows, an IP layer checks if NAT/Masquerading needs to be applied to the outgoing packet. If so, a packet is processed and is finally given to the device for final transmission by a call to dev_queue_xmit. Device refers to a network interface, which is represented by net_device object. At this point, the Linux stack implements QOS. Queuing disciplines are implemented at the device level.
Packet (sk_buff) is queued to the device according to their priority levels and queuing discipline. Next is to dequeue the packet from the device queue, which is done just after queuing sk_buff. The queued packet may be transmitted here, depending on the bandwidth for the packet’s priority. If so, the link layer header is prepended to the packet, and the device-specific hard transmit routine is called to transmit the frame. If we are unable to transmit the frame, the packet is requeued on the device queue and Tx softIRQ is raised on the CPU adding device to the CPU’s transmit queue. Later on when the TX interrupt is processed, frames are dequeued from the device queue and transmitted.
1.1.2 Moving Up the Stack
Refer to Fig. 1.3 for the flow of packet up the stack. We start with the reception of packets at the network interface. Interrupt is generated once the packet is completely DMAed on driver’s Rx ring buffer (for details see Section 18.5). In the interrupt handler, we just remove the frame from the ring buffer and queue it on CPU’s input queue. By CPU I we mean the CPU that is interrupted. It is clear at this point that there is per CPU input queue. Once the packet is queued on the CPU’s input queue, Rx NET softIRQ is raised for the CPU by call to netif_rx. Once again, softIRQ’s are raised and processed per CPU.
Later when Rx softIRQ is processed, packets are de-queued from CPU’s receive queue and processed one-by-one. The packet is processed completely until its destination here, which means that the TCP data packet is processed until the TCP data segment is queued on the socket’s receive queue. Let’s see how is this processing done at various protocol layers.
netif_receive_skb is called to process each packet in Rx softIRQ. The first step is to determine the Internet protocol family to which a packet belongs. This is also known as packet protocol switching. We send the packet to the raw socket in case any raw socket is opened for the device. Once the protocol family is identified, which in our case is IP, we call the protocol handler routine. For IP, this is the ip_rcv routine. ip_rcv tries to de-NAT or de-masquerade the packet at this point, if required. The routing decisions are made on the packet. If it needs to be delivered locally, the packet is passed through firewall policies configured for the locally acceptable IP packets. If everything is OK, ip_local_deliver_finish is called to find the next protocol layer for the packet.
ip_local_deliver_finish implements INET protocol switching code. Once we identify the INET protocol, its handler is called to further process the IP datagram. The IP datagram may belong to ICMP, UDP, and TCP.
Since our discussion is limited to TCP, the protocol handler is tcp_v4_rcv. The very first job of the TCP handler is to find out socket for the TCP packet. This may be a new open request for the listening socket or may be another packet for the established socket. So here, various hash tables are looked into. If the packet belongs to the established socket, the TCP engine processes the TCP segment. If the TCP segment contains in-sequence data, it is queued on the socket’s receive queue. If there are any data to be sent, they is sent along with the the ACK for the data arrived here. Finally, when application issues read over the TCP socket, the kernel processes the request by providing data from the socket’s receive queue.
The Linux stack maps to the OSI networking model (see Fig. 1.4).
1.2 SOURCE CODE ORGANIZATION FOR LINUX 2.4.20
Figure 1.5 shows the Kernel source tree.
1.2.1 Source Code Organization for Networking Code
Figure 1.6 shows the kernel networking source tree.
1.3 TCP/IP STACK AND KERNEL CONTROL PATHS
In this section we will see how TCP data are being processed by the Linux kernel. In totality, we will see different kernel control paths and processor context that are involved in packet processing through the kernel. When the process writes data over the TCP socket, it issues write/send system calls (see Fig. 1.7). The system call takes the process from the user land to the kernel, and now the kernel executes on behalf of the process as shown by the solid gray line. Let’s determine the different points in the kernel where the kernel thread sending TCP data on behalf of the process preempts itself.
Kernel Control Path 1. In this kernel control path, the kernel thread processes TCP data through the complete TCP/IP stack and returns only after transmitting data from the physical interface.
Kernel Control Path 2. This kernel control path processes data through TCP/IP stack but fails to transmit data because the device lock could not be obtained. In this case, the kernel thread returns after raising Tx softIRQ. SoftIRQ processing is deferred to some later point of time which will transmit data queued up on the device. See Section 17.1 for details on softIRQ processing.
Kernel Control Path 3. This kernel control path processes data through the TCP layer but is not able to take it further because the QOS policy is not allowing further transmission of data. It may happen that either someone else is processing the queue on which packet is queued or the quota for queue is over. In the later case, a timer is installed which will process the queue later.
Kernel Control Path 4. This kernel control path processes data through the TCP layer but cannot proceed any further and returns from here. The reason may be that the TCP state machine or congestion algorithm does not allow further transmission of data. These data will be processed later by the TCP state machine on generation of some TCP event.
Kernel Control Path 5. This kernel control path may execute in interrupt context or kernel context. Kernel context may come from softIRQ daemon, which runs as kernel thread and has no user context. Kernel context may also come from kernel thread corresponding to user process which enables softIRQ on the CPU by call to spin_unlock_bh. See Section 17.6 for more detail. This kernel control path processes all the data queued by control path 2.
Kernel Control Path 6. This kernel control path executes as a high-priority tasklet that is part of softIRQ. This may also be executed in interrupt context or kernel context as discussed above. This processes data queued by control path 3.
Kernel Control Path 7. This kernel control path executes as softIRQ when incoming TCP packet is being processed. When a packet is received, it is processed by Rx softIRQ. When a TCP packet is processed in softIRQ, it may generate an event causing transmission of pending data in the send queue. This kernel control path transmits data that are queued by control path 4.
On the reception side, the packet is processed in two steps (see Fig. 1.8). An interrupt handler plucks received a packet from the DMA ring buffer and queues it on the CPU-specific input queue and raises Rx softIRQ. Rx softIRQ is processed at some later point of time in interrupt context or by softIRQ daemon. The TCP data packet is processed completely by Rx softIRQ until it is queued on the socket’s receive queue or is eaten up by the application. The TCP ACK packet is processed by a TCP state machine, and softIRQ returns only after action is taken on the events generated by the incoming ACK.
1.4 LINUX KERNEL UNTIL VERSION 2.4 IS NON-PREEMPTIBLE
Let’s define the term preemptive first and then we will move ahead with its effect on the Linux kernel. Preemption in general means that the current execution context can be forced to give away CPU for some other execution context under certain conditions. Now we will say that what is so great about it is that it is happening on any multitasking OS. On a multitasking OS, many user land processes run on the CPU one at a time. These processes are assigned quota and continue to occupy CPU until they have exhausted their quota. Once the quota for the currently running process is over, it is replaced by some other runnable process on the CPU even if the former was already executing by the kernel scheduler. So, we can say that the process was preempted here. Very true, the user land process is preempted to fairly give other processes a chance to run on the CPU. We are not discussing scheduling with respect to real-time processes and are discussing only normal priority processes that are scheduled based on a round-robin scheduling policy. This way kernel preempts the user land process.
(Continues…)
Excerpted from TCP/IP Architecture, Design and Implementation in Linuxby S. Seth M.A. Venkatesulu Copyright © 2008 by IEEE Computer Society. Excerpted by permission.
All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.