TCP call flow in Linux Kernel

TCP call flow in Linux Kernel - linux

I am trying to get the TCP call flow inside the Linux Kernel with a version 3.8 for different user space APIs such as connect, bind, listen and accept. Can anyone provide me with a flowchart for flow calls? I was able to find for data flow using send and recv APIs.
Another question, when a client connects to a server, the server creates a new socket to that client for that specific connection returned by the accept API. My question does the Linux Kernel maintain any relation between the listening socket and the socket derived from it in some hash bind table or not?

1st question:
http://www.danzig.jct.ac.il/tcp-ip-lab/ibm-tutorial/3376c210.html
All the lectures at Haifux are classic:
http://www.haifux.org/lectures/172/netLec.pdf
http://www.haifux.org/lectures/217/netLec5.pdf
And this is from the original author/maintainer in linux networking himself:
http://vger.kernel.org/~davem/skb.html
http://vger.kernel.org/~davem/tcp_output.html
http://vger.kernel.org/~davem/tcp_skbcb.html
2nd question: Yes, all existing connections are maintained in a critical table: tcp_hashinfo. Its' memory address can be read from /proc/kallsyms. "critical" because reading from it requires locking, so don't try walking the table even though you have the address. Use globally exported symbols like "inet_lookup_listener" or "inet_lookup_established" to walk the table instead.
More info here:
How to identify a specific socket between User Space and Kernel Space?

Flowcharts? Flow diagrams? Not a chance. We would love to have them, but they do not exist but you can review the code; patches happily reviewed.
A socket returns a file descriptor; the process file descriptor table maintains the association between the socket and the other kernel data structures. The file descriptor makes this a simple array indexing operation, no hashing needed.

Related

How to identify and send messages to application from Kernel?

I'm writing a kernel module that sends and receives internet packets and I'm using Generic Netlink to communicate between Kernel and Userspace.
When the application wants to send an internet message (doesn't matter what the protocol is), I can send it to the Kernel with no problems via one of the functions I defined in my generic netlink family and the module sends it through the wire. All is fine.
But when the module receives a packet, how can I reach the appropriate process to deliver the message? My trouble is not in identifying the correct process: that is done via custom protocols (e.g. IP tables); but it consists in what information should I store to notify the correct process?
So far I keep only the portid of the process (because it initiates the communication) and I have been trying to use the function genlmsg_unicast(), but it was altered in a Kernel version of 2009 in such a way that it requires an additional parameter (besides skb *buffer and portid): a pointer to a struct net. None of the tutorials I have found addresses this issue.
I tried using &init_net as the new parameter, but the computer just freezes and I have to restart it through the power button.
Any help is appreciated.

Discovered what was causing the issue:
It turned out that I was freeing the buffer at the end of the function. #facepalm
I shouldn't be doing so, because the buffer gets queued and it waits there until it is actually delivered. So it is not the caller's reponsability to free the buffer, if the function genlmsg_unicast() succeeds.
Now it works with &init_net.

How to get caller pid in zmq (local socket)

Im new to zmq. Im using the same for local IPC in a Linux based OS (The socket is AF_UNIX type)
But I could not find a way to get the caller's (client) process Id. Is there any way to find the same using zmq ? (Finding the pid of the caller is must for my access control requirement and if zmq does not provide the same then I should switch to dbus)
Please help me.

Forget most of the low-level socket designs and worries. Think higher in the sky. ZeroMQ is a pretty higher-level messaging concept. So you will have zero-worries about most of the socket-io problems.
For more on these ZMQ principles, read Pieter Hintjens' design maxims and his resources-rich book "Code Connected, Vol.1".
That said, the solution is fully in your control.
Solution
Create a problem-specific multi-zmq-socket / multi-zmq-pattern (multiple zmq-primitives used and orchestrated by your application level logic) as a problem-specific formal communication handshaking.
Ensure the <sender> adds it's own PID into message.
Re/authorise via another register/auth-socket-pattern with the pre-registered sender from the receiver side, so as to avoid a spoofed attack under a fake/stolen PID-identity.
Adapt your access-control policy according to your ProblemDOMAIN, use and implement any level of crypto-security formal handshaking protocols for identity-validation or key-exchange, to raise your access-control policy security to adequate strengths ( including MIL-STD grades ).

How does ancillary data in sendmsg() work?

sendmsg() allows sending ancillary data to another socket, I am wondering how this works.
1) Is the ancillary data packed along with the normal message?
2) If so, how would a remote receiving socket know how to parse this?
3) How would a remote receiving client retrieve this ancillary data?
Thanks.

Ancillary data is not send on the wire - NEVER. For Unix Domain sockets, Ancillary data is used to send Or receive file descriptors between processes to share or load balance the tasks. Note : Unix Domain sockets transfer the information between processes running on same machine and not between processes running on different machines.
Again, in case of processes running on different machines : your packet without using any ancillary concept would be exactly same as the packet when ancillary concept is applied on sending machine (Or receiving machine). Hence, Ancillary Data is not something shipped with your packet.
Ancillary data is used to receive the EXTRA packet related services/information from the kernel to user space application, which is not available otherwise. For example, say machine B receives some packet on wire and you want to know the ingress interface the packet arrived from ? How would you know this ? Ancillary Data come to the rescue.
Ancillary data are kind of flags set in ancillary control buffer and passed to kernel when sendmsg()/recvmsg() is called, which tells the kernel that when packet is send or arrive, what extra services/information is to be provided to application invoking the calls.
Ancillary Data is the means Communication between kernel and user space application Or between processes on same machine in case of UNIX sockets. It is not something the packet on wire has.
For your reference, download code example here which runs perfectly on my ubuntu machine. Ancillary data concept is demonstrated in src/igmp_pkt_reciever.c .

You can only use ancillary data in a few select ways:
You can use it to get the receiving interface (IPv4)
You can use it to specify the hop limit (for IPv6)
You can use it to specify traffic class (again, IPv6)
....
You can use it to pass/receive file descriptors or user credentials (Unix domain)
The three cases are only artificial API methods of receiving control information from kernel land via recvmsg(2). The last one is the most interesting: the only case where ancillary data is actually sent is with Unix domain sockets where everything happens in the kernel so nothing actually gets on the wire.

Sending Data over network inside kernel

I'm writing a driver in Linux kernel that sends data over the network . Now suppose that my data to be sent (buffer) is in kernel space . how do i send the data without creating a socket (First of all is that a good idea at all ? ) .I'm looking for performance in the code rather than easy coding . And how do i design the receiver end ? without a socket connection , can i get and view the data on the receiver end (How) ? And will all this change ( including the performance) if the buffer is in user space (i'll do a copy from user if it does :-) ) ?

If you are looking to send data on the network without sockets you'd need to hook into the network drivers and send raw packets through them and filter their incoming packets for those you want to hijack. I don't think the performance benefit will be large enough to warrant this.
I don't even think there are normal hooks for this in the network drivers, I did something relevant in the past to implement a firewall. You could conceivably use the netfilter hooks to do something similar in order to attach to the receive side from the network drivers.

You should probably use netlink, and if you want to really communicate with a distant host (e.g. thru TCP/IPv6) use a user-level proxy application for that. (so kernel module use netlink to your application proxy, which could use TCP, or even go thru ssh or HTTP, to send the data remotely, or store it on-disk...).
I don't think that having a kernel module directly talking to a distant host makes sense otherwise (e.g. security issues, filtering, routing, iptables ...)
And the real bottleneck is almost always the (physical) network itself. a 1Gbit ethernet is almost always much slower than what a kernel module, or an application, can sustainably produce (and also latency issues).

How does an asynchronous socket server work?

I should state that I'm not asking about specific implementation details (yet), but just a general overview of what's going on. I understand the basic concept behind a socket, and need clarification on the process as a whole. My (probably very wrong) understanding is currently this:
A socket is constantly listening for clients that want to connect (in its own thread). When a connection occurs, an event is raised that spawns another thread to perform the connection process. During the connection process the client is assigned it's own socket in which to communicate with the server. The server then waits for data from the client and when data arrives an event is raised which spawns a thread to read the data from a stream into a buffer.
My questions are:
How off is my understanding?
Does each client socket require it's own thread to listen for data on?
How is data routed to the correct client socket? Is this something taken care of by the guts of TCP/UDP/kernel?
In this threaded environment, what kind of data is typically being shared, and what are the points of contention?
Any clarifications and additional explanation would be greatly appreciated.
EDIT:
Regarding the question about what data is typically shared and points of contention, I realize this is more of an implementation detail than it is a question regarding general process of accepting connections and sending/receiving data. I had looked at a couple implementations (SuperSocket and Kayak) and noticed some synchronization for things like session cache and reusable buffer pools. Feel free to ignore this question. I've appreciated all your feedback.

One thread per connection is bad design (not scalable, overly complex) but unfortunately way too common.
A socket server works more or less like this:
A listening socket is setup to accept connections, and added to a socketset
The socket set is checked for events
If the listening socket has pending connections, new sockets are created by accepting the connections, and then added to the socket set
If a connected socket has events, the relevant IO functions are called
The socket set is checked for events again
This happens in one thread, you can easily handle thousands of connected sockets in a single thread, and there's few valid reasons for making this more complex by introducing threads.
while running
select on socketset
for each socket with events
if socket is listener
accept new connected socket
add new socket to socketset
else if socket is connection
if event is readable
read data
process data
else if event is writable
write queued data
else if event is closed connection
remove socket from socketset
end
end
done
done
The IP stack takes care of all the details of which packets go to what "socket" in which order. Seen from the applications point of view, a socket represents a reliable ordered byte stream (TCP) or an unreliable unordered sequence of packets(UDP)
EDIT: In response to updated question.
I don't know either of the libraries you mention, but on the concepts you mention:
A session cache typically keeps data associated with a client, and can reuse this data for multiple connections. This makes sense when your application logic requires state information, but it's a layer higher than the actual networking end. In the above sample, the session cache would be used by the "process data" part.
Buffer pools are also an easy and often effective optimization of a high-traffic server. The concept is very easy to implement, instead of allocating/deallocating space for storing data you read/write, you fetch a preallocated buffer from a pool, use it, then return it to a pool. This avoids the (sometimes relatively expensive) backend allocation/deallocation mechanisms. This is not directly related to networking, you can just as well use buffer pools for e.g. something that reads chunks of files and process them.

How off is my understanding?
Pretty far.
Does each client socket require it's own thread to listen for data on?
No.
How is data routed to the correct client socket? Is this something taken care of by the guts of TCP/UDP/kernel?
TCP/IP is a number of layers of protocol. There's no "kernel" to it. It's pieces, each with a separate API to the other pieces.
The IP Address is handled in on place.
The port # is handled in another place.
The IP addresses are matched up with MAC addresses to identify a particular host. The port # is what ties a TCP (or UDP) socket to a particular piece of application software.
In this threaded environment, what kind of data is typically being shared, and what are the points of contention?
What threaded environment?
Data sharing? What?
Contention? The physical channel is the number one point of contention. (Ethernet, for example depends on collision-detection.) After that, well, every part of the computer system is a scarce resource shared by multiple applications and is a point of contention.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string