网络数据的发送

网络数据的发送

过程概述

说明: 本文使用6.5内核, 涉及网卡时使用intel igb。

使用系统调用(write/send/sendto/sendmsg等)写数据
数据穿过socket子系统，进入socket协议族(protocol family)系统
协议族处理：数据穿过协议层，这一过程(在许多情况下)会将数据(data)转换成数据包(packet)
数据穿过路由层，这会涉及路由缓存和ARP缓存的更新；如果目的MAC不在ARP缓存表中，将触发一次ARP广播来查找MAC地址
穿过协议层，packet到达设备无关层(device agnostic layer)
使用XPS(如果启用)或散列函数选择发送队列
调用网卡驱动的发送函数
数据传送到网卡的qdisc(queue discipline，排队规则)
qdisc会直接发送数据(如果可以)，或者将其放到队列，下次触发NET_TX类型软中断(softirq)的时候再发送
数据从qdisc传送给驱动程序
驱动程序创建所需的DMA映射，以便网卡从RAM读取数据
驱动向网卡发送信号，通知数据可以发送了
网卡从RAM中获取数据并发送
发送完成后，设备触发一个硬中断(IRQ)，表示发送完成
硬中断处理函数被唤醒执行。对许多设备来说，这会触发NET_RX类型的软中断，然后NAPI poll循环开始收包
poll函数会调用驱动程序的相应函数，解除DMA映射，释放数据

系统调用

概要

无论是socket fd的write系统调用，还是send/sendto/sendmsg系统调用，最终都会调用sock_sendmsg()或者sock_sendmsg_nosec(), 而sock_sendmsg()最终也是调用sock_sendmsg_nosec()。

write()

https://elixir.bootlin.com/linux/latest/source/net/socket.c

static ssize_t sock_write_iter(struct kiocb *iocb, struct iov_iter *from)
{
	//...
	res = sock_sendmsg(sock, &msg);
	//...
}

send()与sendto()

https://elixir.bootlin.com/linux/latest/source/net/socket.c

SYSCALL_DEFINE4(send, int, fd, void __user *, buff, size_t, len,
		unsigned int, flags)
{
	return __sys_sendto(fd, buff, len, flags, NULL, 0);
}

https://elixir.bootlin.com/linux/latest/source/net/socket.c

SYSCALL_DEFINE6(sendto, int, fd, void __user *, buff, size_t, len,
		unsigned int, flags, struct sockaddr __user *, addr,
		int, addr_len)
{
	return __sys_sendto(fd, buff, len, flags, addr, addr_len);
}

__sys_sendto()调用sock_sendmsg()。

sendmsg()

https://elixir.bootlin.com/linux/latest/source/net/socket.c

SYSCALL_DEFINE3(sendmsg, int, fd, struct user_msghdr __user *, msg, unsigned int, flags)
{
	return __sys_sendmsg(fd, msg, flags, true);
}

__sys_sendmsg:

____sys_sendmsg
        sock_sendmsg_nosec(sock, msg_sys);
        sock_sendmsg(sock, msg_sys);

sock_sendmsg()调用sock_sendmsg_nosec()。

套接口层

sock_sendmsg_nosec()

如上文所述，所有发送相关的系统调用，最终都会进入sock_sendmsg_nosec函数:

https://elixir.bootlin.com/linux/latest/source/net/socket.c

static inline int sock_sendmsg_nosec(struct socket *sock, struct msghdr *msg)
{
	int ret = INDIRECT_CALL_INET(sock->ops->sendmsg, inet6_sendmsg,
				     inet_sendmsg, sock, msg,
				     msg_data_left(msg));
	//...
}

可见，sock_sendmsg_nosec()最终调用inet6_sendmsg()(IPv6)或inet_sendmsg():
INDIRECT_CALL_INET宏使用INDIRECT_CALL_2，如果sock→ops→sendmsg指向inet6_sendmsg或inet_sendmsg, 则直接调用这两个函数，以减少retpoline技术带来的开销。

inet_sendmsg()

inet_sendmsg():

https://elixir.bootlin.com/linux/latest/source/net/ipv4/af_inet.c

int inet_sendmsg(struct socket *sock, struct msghdr *msg, size_t size)
{
	//...
	return INDIRECT_CALL_2(sk->sk_prot->sendmsg, tcp_sendmsg, udp_sendmsg,
			       sk, msg, size);
}

inet6_sendmsg():

https://elixir.bootlin.com/linux/latest/source/net/ipv6/af_inet6.c

int inet6_sendmsg(struct socket *sock, struct msghdr *msg, size_t size)
{
	//...
	return INDIRECT_CALL_2(prot->sendmsg, tcp_sendmsg, udpv6_sendmsg,
			       sk, msg, size);
}

可见:
对于tcp, 最终调用tcp_sendmsg()。
对于udp, 最终调用udpv6_sendmsg()/udp_sendmsg()。

传输层

tcp_sendmsg()

tcp_sendmsg按MSS进行分段、添加到发送队列并将用户数据拷贝到分段里面，根据相关判断设置PSH标志，最后调用__tcp_push_pending_frames、tcp_push_one、tcp_push函数发送报文。

tcp_sendmsg(): https://elixir.bootlin.com/linux/latest/source/net/ipv4/tcp.c

tcp_sendmsg_locked
    __tcp_push_pending_frames
        tcp_write_xmit
    tcp_push_one
        tcp_write_xmit
    tcp_push
        __tcp_push_pending_frames
            tcp_write_xmit

tcp_write_xmit(): https://elixir.bootlin.com/linux/latest/source/net/ipv4/tcp_output.c

tcp_transmit_skb
    __tcp_transmit_skb
        ip_queue_xmit: https://elixir.bootlin.com/linux/latest/source/net/ipv4/ip_output.c
            ip_local_out

udp_sendmsg()

udp_sendmsg(): https://elixir.bootlin.com/linux/latest/source/net/ipv4/udp.c

udp_send_skb
    ip_send_skb: https://elixir.bootlin.com/linux/latest/source/net/ipv4/ip_output.c
        ip_local_out

网络层

ip_local_out()

https://elixir.bootlin.com/linux/latest/source/net/ipv4/ip_output.c

int __ip_local_out(struct net *net, struct sock *sk, struct sk_buff *skb)
{
	struct iphdr *iph = ip_hdr(skb);

	iph_set_totlen(iph, skb->len);
	ip_send_check(iph);

	/* if egress device is enslaved to an L3 master device pass the
	 * skb to its handler for processing
	 */
	skb = l3mdev_ip_out(sk, skb);
	if (unlikely(!skb))
		return 0;

	skb->protocol = htons(ETH_P_IP);

	return nf_hook(NFPROTO_IPV4, NF_INET_LOCAL_OUT,
		       net, sk, skb, NULL, skb_dst(skb)->dev,
		       dst_output);
}

int ip_local_out(struct net *net, struct sock *sk, struct sk_buff *skb)
{
	int err;

	err = __ip_local_out(net, sk, skb);
	if (likely(err == 1))
		err = dst_output(net, sk, skb);

	return err;
}

ip_local_out调用__ip_local_out，如果返回值为1，则调用路由层dst_output发送数据包。
__ip_local_out(net, sk, skb):
设置IP数据包的长度；
然后调用ip_send_check来计算要写入IP头的校验和；
之后调用nf_hook进入netfilter，其返回值将传递回ip_local_out。如果nf_hook返回1，则表示允许数据包通过，并且调用者应该自己发送数据包。即上文中的ip_local_out检查返回值1，调用dst_output发送数据包。

nf_hook与netfilter:
nf_hook是一个wrapper，它会检查是否有为这个协议族和hook类型(这里分别为 NFPROTO_IPV4和NF_INET_LOCAL_OUT)安装过滤器，然后将返回到IP协议层，避免深入到netfilter或更下面，比如iptables和conntrack。
如果有非常多或者非常复杂的netfilter或iptables规则，那些规则将在触发sendmsg系统调用的用户进程的上下文中执行。如果对这个用户进程设置了CPU亲和性，相应的CPU将花费系统时间(system time)处理出站(outbound)iptables规则。
假设nf_hook返回1，表示调用者(在这种情况下是IP协议层)应该自己发送数据包。

dst_output()

dst代码在内核中实现协议无关的目标缓存。

https://elixir.bootlin.com/linux/latest/source/include/net/dst.h

/* Output packet to network from transport.  */
static inline int dst_output(struct net *net, struct sock *sk, struct sk_buff *skb)
{
	return INDIRECT_CALL_INET(skb_dst(skb)->output,
				  ip6_output, ip_output,
				  net, sk, skb);
}

dst_output()查找关联到skb的dst条目，然后调用output方法，即ip_output方法(IPv6则为ip6_output方法)。

ip_output()

ip_output()调用ip_finish_output():

https://elixir.bootlin.com/linux/latest/source/net/ipv4/ip_output.c

int ip_output(struct net *net, struct sock *sk, struct sk_buff *skb)
{
	struct net_device *dev = skb_dst(skb)->dev, *indev = skb->dev;

	IP_UPD_PO_STATS(net, IPSTATS_MIB_OUT, skb->len);

	skb->dev = dev;
	skb->protocol = htons(ETH_P_IP);

	return NF_HOOK_COND(NFPROTO_IPV4, NF_INET_POST_ROUTING,
			    net, sk, skb, indev, dev,
			    ip_finish_output,
			    !(IPCB(skb)->flags & IPSKB_REROUTED));
}

首先，更新IPSTATS_MIB_OUT统计计数：IP_UPD_PO_STATS宏将更新字节数和包数统计。
之后，设置要发送此skb的设备以及协议。
最后，通过调用NF_HOOK_COND将控制权交给netfilter。NF_HOOK_COND通过检查传入的条件来工作：如果此条件为真，则skb将发送给netfilter。如果netfilter允许包通过，okfn回调函数将被调用，此处okfn是ip_finish_output。

ip_finish_output()

ip_finish_output()调用__ip_finish_output():

https://elixir.bootlin.com/linux/latest/source/net/ipv4/ip_output.c

static int __ip_finish_output(struct net *net, struct sock *sk, struct sk_buff *skb)
{
	unsigned int mtu;

#if defined(CONFIG_NETFILTER) && defined(CONFIG_XFRM)
	/* Policy lookup after SNAT yielded a new policy */
	if (skb_dst(skb)->xfrm) {
		IPCB(skb)->flags |= IPSKB_REROUTED;
		return dst_output(net, sk, skb);
	}
#endif
	mtu = ip_skb_dst_mtu(sk, skb);
	if (skb_is_gso(skb))
		return ip_finish_output_gso(net, sk, skb, mtu);

	if (skb->len > mtu || IPCB(skb)->frag_max_size)
		return ip_fragment(net, sk, skb, mtu, ip_finish_output2);

	return ip_finish_output2(net, sk, skb);
}

如果启用了netfilter和数据包转换(XFRM)，则更新skb标志并调用dst_output将其发回；
否则(通常情况):
1 如果是gso则调用ip_finish_output_gso()
2 如果数据包需要分片则调用ip_fragment()
3 直接调用ip_finish_output2()
这三个函数均会调用ip_finish_output2()。

ip_finish_output2()

https://elixir.bootlin.com/linux/latest/source/net/ipv4/ip_output.c

static int ip_finish_output2(struct net *net, struct sock *sk, struct sk_buff *skb)
{
	//...
	neigh = ip_neigh_for_gw(rt, skb, &is_v6gw);
	if (!IS_ERR(neigh)) {
		//...
		res = neigh_output(neigh, skb, is_v6gw);
		//...
	}
	//...
}

ip_neigh_for_gw()查找邻居缓存，如果未找到会创建一个邻居。
之后调用邻居系统的neigh_output()。

邻居系统

neigh_output()

https://elixir.bootlin.com/linux/latest/source/include/net/neighbour.h

static inline int neigh_output(struct neighbour *n, struct sk_buff *skb,
			       bool skip_cache)
{
	const struct hh_cache *hh = &n->hh;

	/* n->nud_state and hh->hh_len could be changed under us.
	 * () is taking care of the race later.
	 */
	if (!skip_cache &&
	    (READ_ONCE(n->nud_state) & NUD_CONNECTED) &&
	    READ_ONCE(hh->hh_len))
		return neigh_hh_output(hh, skb);

	return n->output(n, skb);
}

如果目标是NUD_CONNECTED且硬件头(hh: hardware header)已被缓存(之前发送过数据并进行了缓存)，将调用neigh_hh_output函数;
否则，调用n→output函数。
先说一下结论，这两种情况，最后都会调用dev_queue_xmit()，将skb发送给网络设备子系统，在它进入设备驱动程序层之前将对其进行更多处理。

struct neighbour的output函数指针

下面介绍n→output()的流程:

如果在邻居缓存中找不到邻居条目，会从ip_finish_output2调用__neigh_create创建一个。当调用__neigh_create时，将分配邻居，其output函数最初设置为neigh_blackhole()。之后，它将根据邻居的状态修改output值以指向适当的发送方法。

https://elixir.bootlin.com/linux/latest/source/net/ipv4/arp.c

static int arp_constructor(struct neighbour *neigh)
{
	//...
	if (!dev->header_ops) {
		neigh->nud_state = NUD_NOARP;
		neigh->ops = &arp_direct_ops;
		neigh->output = neigh_direct_output;
	} else {
	//...
		if (dev->header_ops->cache)
			neigh->ops = &arp_hh_ops;
		else
			neigh->ops = &arp_generic_ops;

		if (neigh->nud_state & NUD_VALID)
			neigh->output = neigh->ops->connected_output;
		else
			neigh->output = neigh->ops->output;
	}
	return 0;
}

可见neigh→output有三种取值:
第一种情况是当neigh→ops指向arp_direct_ops，此时neigh→output指向neigh_direct_output();
另外两种情况是neigh→ops指向arp_hh_ops或者arp_generic_ops，此时neigh→output要么指向neigh_resolve_output()，要么指向neigh_connected_output()。
neigh_direct_output(), neigh_resolve_output()以及neigh_connected_output()，
这三个函数，无一例外，最终都会调用dev_queue_xmit()。

设备系统

dev_queue_xmit()

https://elixir.bootlin.com/linux/latest/source/include/linux/netdevice.h

static inline int dev_queue_xmit(struct sk_buff *skb)
{
	return __dev_queue_xmit(skb, NULL);
}

https://elixir.bootlin.com/linux/latest/source/net/core/dev.c

/**
 * __dev_queue_xmit() - transmit a buffer
 * @skb:	buffer to transmit
 * @sb_dev:	suboordinate device used for L2 forwarding offload
 *
 * Queue a buffer for transmission to a network device. The caller must
 * have set the device and priority and built the buffer before calling
 * this function. The function can be called from an interrupt.
 *
 * When calling this method, interrupts MUST be enabled. This is because
 * the BH enable code must have IRQs enabled so that it will not deadlock.
 *
 * Regardless of the return value, the skb is consumed, so it is currently
 * difficult to retry a send to this method. (You can bump the ref count
 * before sending to hold a reference for retry if you are careful.)
 *
 * Return:
 * * 0				- buffer successfully transmitted
 * * positive qdisc return code	- NET_XMIT_DROP etc.
 * * negative errno		- other errors
 */
int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
{
	struct net_device *dev = skb->dev;
	struct netdev_queue *txq = NULL;
	struct Qdisc *q;
	int rc = -ENOMEM;
	bool again = false;

	//...

	if (!txq)
		txq = netdev_core_pick_tx(dev, skb, sb_dev);

	q = rcu_dereference_bh(txq->qdisc);

	trace_net_dev_queue(skb);
	if (q->enqueue) {
		rc = __dev_xmit_skb(skb, q, dev, txq);
		goto out;
	}

	/* The device has no queue. Common case for software devices:
	 * loopback, all the sorts of tunnels...

	 * Really, it is unlikely that netif_tx_lock protection is necessary
	 * here.  (f.e. loopback and IP tunnels are clean ignoring statistics
	 * counters.)
	 * However, it is possible, that they rely on protection
	 * made by us here.

	 * Check this and shot the lock. It is not prone from deadlocks.
	 *Either shot noqueue qdisc, it is even simpler 8)
	 */
	if (dev->flags & IFF_UP) {
		int cpu = smp_processor_id(); /* ok because BHs are off */

		/* Other cpus might concurrently change txq->xmit_lock_owner
		 * to -1 or to their cpu id, but not to our id.
		 */
		if (READ_ONCE(txq->xmit_lock_owner) != cpu) {
			if (dev_xmit_recursion())
				goto recursion_alert;

			skb = validate_xmit_skb(skb, dev, &again);
			if (!skb)
				goto out;

			HARD_TX_LOCK(dev, txq, cpu);

			if (!netif_xmit_stopped(txq)) {
				dev_xmit_recursion_inc();
				skb = dev_hard_start_xmit(skb, dev, txq, &rc);
				dev_xmit_recursion_dec();
				if (dev_xmit_complete(rc)) {
					HARD_TX_UNLOCK(dev, txq);
					goto out;
				}
			}
			//...
		} else {
			//...
		}
	}

	//...
}

现代网卡是有多个发送队列的，__dev_queue_xmit()首先需要选择一个发送队列到struct netdev_queue *txq，之后调用__dev_xmit_skb():
(其实还有dev_hard_start_xmit()这条非主线路径，后面会分析dev_hard_start_xmit()，这里作一个简要说明:
环回设备和隧道设备没有队列，如果设备当前处于运行状态:
如果发送锁不由此CPU拥有，dev_xmit_recursion()会函数判断per-CPU计数器变量softnet_data.xmit.recursion是否超过XMIT_RECURSION_LIMIT。需要这个计数是因为一个程序可能会在这段代码这里持续发送数据，然后被抢占，调度程序选择另一个程序来运行。如果没有超过，调用dev_hard_start_xmit()。
如果当前CPU是发送锁的拥有者，或者计数超过XMIT_RECURSION_LIMIT，则不进行发送，此时打印告警日志，设置错误码并返回。)

__dev_xmit_skb

https://elixir.bootlin.com/linux/latest/source/net/core/dev.c

static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
				 struct net_device *dev,
				 struct netdev_queue *txq)
{
	spinlock_t *root_lock = qdisc_lock(q);
	struct sk_buff *to_free = NULL;
	bool contended;
	int rc;

	qdisc_calculate_pkt_len(skb, q);

	if (q->flags & TCQ_F_NOLOCK) {
		if (q->flags & TCQ_F_CAN_BYPASS && nolock_qdisc_is_empty(q) &&
		    qdisc_run_begin(q)) {
			/* Retest nolock_qdisc_is_empty() within the protection
			 * of q->seqlock to protect from racing with requeuing.
			 */
			if (unlikely(!nolock_qdisc_is_empty(q))) {
				rc = dev_qdisc_enqueue(skb, q, &to_free, txq);
				__qdisc_run(q);
				qdisc_run_end(q);

				goto no_lock_out;
			}

			qdisc_bstats_cpu_update(q, skb);
			if (sch_direct_xmit(skb, q, dev, txq, NULL, true) &&
			    !nolock_qdisc_is_empty(q))
				__qdisc_run(q);

			qdisc_run_end(q);
			return NET_XMIT_SUCCESS;
		}

		rc = dev_qdisc_enqueue(skb, q, &to_free, txq);
		qdisc_run(q);

no_lock_out:
		if (unlikely(to_free))
			kfree_skb_list_reason(to_free,
					      SKB_DROP_REASON_QDISC_DROP);
		return rc;
	}

	/*
	 * Heuristic to force contended enqueues to serialize on a
	 * separate lock before trying to get qdisc main lock.
	 * This permits qdisc->running owner to get the lock more
	 * often and dequeue packets faster.
	 * On PREEMPT_RT it is possible to preempt the qdisc owner during xmit
	 * and then other tasks will only enqueue packets. The packets will be
	 * sent after the qdisc owner is scheduled again. To prevent this
	 * scenario the task always serialize on the lock.
	 */
	contended = qdisc_is_running(q) || IS_ENABLED(CONFIG_PREEMPT_RT);
	if (unlikely(contended))
		spin_lock(&q->busylock);

	spin_lock(root_lock);
	if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
		__qdisc_drop(skb, &to_free);
		rc = NET_XMIT_DROP;
	} else if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
		   qdisc_run_begin(q)) {
		/*
		 * This is a work-conserving queue; there are no old skbs
		 * waiting to be sent out; and the qdisc is not running -
		 * xmit the skb directly.
		 */

		qdisc_bstats_update(q, skb);

		if (sch_direct_xmit(skb, q, dev, txq, root_lock, true)) {
			if (unlikely(contended)) {
				spin_unlock(&q->busylock);
				contended = false;
			}
			__qdisc_run(q);
		}

		qdisc_run_end(q);
		rc = NET_XMIT_SUCCESS;
	} else {
		rc = dev_qdisc_enqueue(skb, q, &to_free, txq);
		if (qdisc_run_begin(q)) {
			if (unlikely(contended)) {
				spin_unlock(&q->busylock);
				contended = false;
			}
			__qdisc_run(q);
			qdisc_run_end(q);
		}
	}
	spin_unlock(root_lock);
	if (unlikely(to_free))
		kfree_skb_list_reason(to_free, SKB_DROP_REASON_QDISC_DROP);
	if (unlikely(contended))
		spin_unlock(&q->busylock);
	return rc;
}

关键函数是__qdisc_run()与sch_direct_xmit()。
而__qdisc_run()最终也会调用sch_direct_xmit(): https://elixir.bootlin.com/linux/latest/source/net/sched/sch_generic.c

__qdisc_run()
    qdisc_restart()
        sch_direct_xmit()
            dev_hard_start_xmit()
    __netif_schedule()
        __netif_reschedule()
            raise_softirq_irqoff(NET_TX_SOFTIRQ)
                net_tx_action()
                    qdisc_run()
                        __qdisc_run()//软中断的路径会继续调用__qdisc_run

可见，最终会调用dev_hard_start_xmit()。在分析dev_hard_start_xmit()之前，先看一下__qdisc_run()的实现。

__qdisc_run()

https://elixir.bootlin.com/linux/latest/source/net/sched/sch_generic.c

void __qdisc_run(struct Qdisc *q)
{
	int quota = READ_ONCE(dev_tx_weight);
	int packets;

	while (qdisc_restart(q, &packets)) {
		quota -= packets;
		if (quota <= 0) {
			if (q->flags & TCQ_F_NOLOCK)
				set_bit(__QDISC_STATE_MISSED, &q->state);
			else
				__netif_schedule(q);

			break;
		}
	}
}

quota是一个定额，其值为dev_tx_weight(默认为64)，可以通过系统配置。

只要没有超过这个限额值，且队列里有skb, 就会一直循环调用qdisc_restart()发送数据包;

一旦超出这个限额值:
如果队列支持无锁运行(TCQ_F_NOLOCK)，则设置队列状态为__QDISC_STATE_MISSED，在将来的某个时间点继续处理未完成的数据包;
否则(即队列不支持无锁运行)，则调用__netif_schedule(q)将队列放入网络设备的调度列表中，之后继续在软中断中处理队列中的数据包。

在__netif_schedule()里，会触发NET_TX_SOFTIRQ软中断，但从上述代码中可以看出，并不是每次发送都会调用__netif_schedule()的，也就是说，并不是每次发送都会触发TX软中断，但接收不是，接收会触发NET_RX_SOFTIRQ，
因此通常RX软中断(NET_RX_SOFTIRQ)比TX软中断(NET_TX_SOFTIRQ)要多。

dev_hard_start_xmit()

dev_hard_start_xmit(): https://elixir.bootlin.com/linux/latest/source/net/core/dev.c

xmit_one()
    netdev_start_xmit()
        __netdev_start_xmit()

__netdev_start_xmit():

https://elixir.bootlin.com/linux/latest/source/include/linux/netdevice.h

static inline netdev_tx_t __netdev_start_xmit(const struct net_device_ops *ops,
					      struct sk_buff *skb, struct net_device *dev,
					      bool more)
{
	__this_cpu_write(softnet_data.xmit.more, more);
	return ops->ndo_start_xmit(skb, dev);
}

ops→ndo_start_xmit()即为网卡驱动的发送函数。

网卡驱动

igb_xmit_frame()

以intel igb网卡为例，网卡驱动的发送函数即igb_xmit_frame():

https://elixir.bootlin.com/linux/latest/source/drivers/net/ethernet/intel/e1000/e1000_main.c

static const struct net_device_ops igb_netdev_ops = {
	//...
	.ndo_start_xmit		= igb_xmit_frame,
	//...
};

igb_xmit_frame()调用igb_xmit_frame_ring()。

igb_xmit_frame_ring()

https://elixir.bootlin.com/linux/latest/source/drivers/net/ethernet/intel/igb/igb_main.c

netdev_tx_t igb_xmit_frame_ring(struct sk_buff *skb,
				struct igb_ring *tx_ring)
{
	//...
	tso = igb_tso(tx_ring, first, &hdr_len);
	if (tso < 0)
		goto out_drop;
	else if (!tso)
		igb_tx_csum(tx_ring, first);

	if (igb_tx_map(tx_ring, first, hdr_len))
	//...
}

igb_tx_map函数将skb数据映射到网卡可访问的内存DMA区域。

igb_tx_map()

https://elixir.bootlin.com/linux/latest/source/drivers/net/ethernet/intel/igb/igb_main.c

static int igb_tx_map(struct igb_ring *tx_ring,
		      struct igb_tx_buffer *first,
		      const u8 hdr_len)
{
	struct sk_buff *skb = first->skb;
	struct igb_tx_buffer *tx_buffer;
	union e1000_adv_tx_desc *tx_desc;
	skb_frag_t *frag;
	dma_addr_t dma;
	unsigned int data_len, size;
	u32 tx_flags = first->tx_flags;
	u32 cmd_type = igb_tx_cmd_type(skb, tx_flags);
	u16 i = tx_ring->next_to_use;

	tx_desc = IGB_TX_DESC(tx_ring, i);

	igb_tx_olinfo_status(tx_ring, tx_desc, tx_flags, skb->len - hdr_len);

	size = skb_headlen(skb);
	data_len = skb->data_len;

	dma = dma_map_single(tx_ring->dev, skb->data, size, DMA_TO_DEVICE);

	tx_buffer = first;

	for (frag = &skb_shinfo(skb)->frags[0];; frag++) {
		if (dma_mapping_error(tx_ring->dev, dma))
			goto dma_error;

		/* record length, and DMA address */
		dma_unmap_len_set(tx_buffer, len, size);
		dma_unmap_addr_set(tx_buffer, dma, dma);

		tx_desc->read.buffer_addr = cpu_to_le64(dma);

		while (unlikely(size > IGB_MAX_DATA_PER_TXD)) {
			tx_desc->read.cmd_type_len =
				cpu_to_le32(cmd_type ^ IGB_MAX_DATA_PER_TXD);

			i++;
			tx_desc++;
			if (i == tx_ring->count) {
				tx_desc = IGB_TX_DESC(tx_ring, 0);
				i = 0;
			}
			tx_desc->read.olinfo_status = 0;

			dma += IGB_MAX_DATA_PER_TXD;
			size -= IGB_MAX_DATA_PER_TXD;

			tx_desc->read.buffer_addr = cpu_to_le64(dma);
		}

		if (likely(!data_len))
			break;

		tx_desc->read.cmd_type_len = cpu_to_le32(cmd_type ^ size);

		i++;
		tx_desc++;
		if (i == tx_ring->count) {
			tx_desc = IGB_TX_DESC(tx_ring, 0);
			i = 0;
		}
		tx_desc->read.olinfo_status = 0;

		size = skb_frag_size(frag);
		data_len -= size;

		dma = skb_frag_dma_map(tx_ring->dev, frag, 0,
				       size, DMA_TO_DEVICE);

		tx_buffer = &tx_ring->tx_buffer_info[i];
	}

	/* write last descriptor with RS and EOP bits */
	cmd_type |= size | IGB_TXD_DCMD;
	tx_desc->read.cmd_type_len = cpu_to_le32(cmd_type);

	netdev_tx_sent_queue(txring_txq(tx_ring), first->bytecount);

	/* set the timestamp */
	first->time_stamp = jiffies;

	skb_tx_timestamp(skb);

	/* Force memory writes to complete before letting h/w know there
	 * are new descriptors to fetch.  (Only applicable for weak-ordered
	 * memory model archs, such as IA-64).
	 *
	 * We also need this memory barrier to make certain all of the
	 * status bits have been updated before next_to_watch is written.
	 */
	dma_wmb();

	/* set next_to_watch value indicating a packet is present */
	first->next_to_watch = tx_desc;

	i++;
	if (i == tx_ring->count)
		i = 0;

	tx_ring->next_to_use = i;

	/* Make sure there is space in the ring for the next send. */
	igb_maybe_stop_tx(tx_ring, DESC_NEEDED);

	if (netif_xmit_stopped(txring_txq(tx_ring)) || !netdev_xmit_more()) {
		writel(i, tx_ring->tail);
	}
	return 0;
	//...
}

发送完成

概要

设备发送完成后，设备会触发一个硬中断(IRQ)，表示发送完成。然后，设备驱动程序可以调度一些长时间运行的工作，例如解除DMA映射、释放数据等，具体如何工作则取决于不同的设备。

对于igb驱动程序及其关联设备，发送完成和数据包接收所触发的IRQ是相同的，这也是RX软中断(NET_RX_SOFTIRQ)比TX软中断(NET_TX_SOFTIRQ)要多的另外一个原因。

硬中断处理

igb_msix_ring(): https://elixir.bootlin.com/linux/latest/source/drivers/net/ethernet/intel/igb/igb_main.c

igb_msix_ring() - drivers/net/ethernet/intel/igb/igb_main.c
    __napi_schedule() - net/core/dev.c
        ____napi_schedule() - net/core/dev.c
            __raise_softirq_irqoff(NET_RX_SOFTIRQ) - net/core/dev.c

__raise_softirq_irqoff(NET_RX_SOFTIRQ)发起NET_RX_SOFTIRQ软中断。

软中断处理

ksoftirqd为软中断处理进程，ksoftirqd收到NET_RX_SOFTIRQ软中断后，执行软中断处理函数net_rx_action()，调用网卡驱动poll()函数(对于igb网卡为igb_poll函数)收包，在poll()函数中调用igb_clean_tx_irq()完成数据包的释放。

run_ksoftirqd() - kernel/softirqd.c
    __do_softirq() - kernel/softirqd.c
        h->action(h) - kernel/softirqd.c
            net_rx_action() - net/core/dev.c
                napi_poll() - net/core/dev.c
                    __napi_poll - net/core/dev.c
                        work = n->poll(n, weight) - net/core/dev.c
                            igb_poll() - drivers/net/ethernet/intel/igb/igb_main.c
                                igb_clean_tx_irq() - drivers/net/ethernet/intel/igb/igb_main.c

igb_clean_tx_irq()释放、回收资源。

参考

https://blog.packagecloud.io/monitoring-tuning-linux-networking-stack-sending-data/
https://arthurchiao.art/blog/tuning-stack-tx-zh/
https://zhuanlan.zhihu.com/p/645347804
https://blog.51cto.com/u_15109148/5470017

过程概述

系统调用

概要

write()

send()与sendto()

sendmsg()

套接口层

sock_sendmsg_nosec()

inet_sendmsg()

传输层

tcp_sendmsg()

udp_sendmsg()

网络层

ip_local_out()

dst_output()

ip_output()

ip_finish_output()

ip_finish_output2()

邻居系统

neigh_output()

struct neighbour的output函数指针

设备系统

dev_queue_xmit()

__dev_xmit_skb

__qdisc_run()

dev_hard_start_xmit()

网卡驱动

igb_xmit_frame()

igb_xmit_frame_ring()

igb_tx_map()

发送完成

概要

硬中断处理

软中断处理

参考

Leave a Reply Cancel reply