网络数据的发送
过程概述
说明: 本文使用6.5内核, 涉及网卡时使用intel igb。
- 使用系统调用(write/send/sendto/sendmsg等)写数据
- 数据穿过socket子系统,进入socket协议族(protocol family)系统
- 协议族处理:数据穿过协议层,这一过程(在许多情况下)会将数据(data)转换成数据包(packet)
- 数据穿过路由层,这会涉及路由缓存和ARP缓存的更新;如果目的MAC不在ARP缓存表中,将触发一次ARP广播来查找MAC地址
- 穿过协议层,packet到达设备无关层(device agnostic layer)
- 使用XPS(如果启用)或散列函数选择发送队列
- 调用网卡驱动的发送函数
- 数据传送到网卡的qdisc(queue discipline,排队规则)
- qdisc会直接发送数据(如果可以),或者将其放到队列,下次触发NET_TX类型软中断(softirq)的时候再发送
- 数据从qdisc传送给驱动程序
- 驱动程序创建所需的DMA映射,以便网卡从RAM读取数据
- 驱动向网卡发送信号,通知数据可以发送了
- 网卡从RAM中获取数据并发送
- 发送完成后,设备触发一个硬中断(IRQ),表示发送完成
- 硬中断处理函数被唤醒执行。对许多设备来说,这会触发NET_RX类型的软中断,然后NAPI poll循环开始收包
- poll函数会调用驱动程序的相应函数,解除DMA映射,释放数据
系统调用
概要
无论是socket fd的write系统调用,还是send/sendto/sendmsg系统调用,最终都会调用sock_sendmsg()或者sock_sendmsg_nosec(), 而sock_sendmsg()最终也是调用sock_sendmsg_nosec()。
write()
static ssize_t sock_write_iter(struct kiocb *iocb, struct iov_iter *from)
{
//...
res = sock_sendmsg(sock, &msg);
//...
}
send()与sendto()
SYSCALL_DEFINE4(send, int, fd, void __user *, buff, size_t, len,
unsigned int, flags)
{
return __sys_sendto(fd, buff, len, flags, NULL, 0);
}
SYSCALL_DEFINE6(sendto, int, fd, void __user *, buff, size_t, len,
unsigned int, flags, struct sockaddr __user *, addr,
int, addr_len)
{
return __sys_sendto(fd, buff, len, flags, addr, addr_len);
}
__sys_sendto()调用sock_sendmsg()。
sendmsg()
SYSCALL_DEFINE3(sendmsg, int, fd, struct user_msghdr __user *, msg, unsigned int, flags)
{
return __sys_sendmsg(fd, msg, flags, true);
}
__sys_sendmsg:
____sys_sendmsg sock_sendmsg_nosec(sock, msg_sys); sock_sendmsg(sock, msg_sys);
sock_sendmsg()调用sock_sendmsg_nosec()。
套接口层
sock_sendmsg_nosec()
如上文所述,所有发送相关的系统调用,最终都会进入sock_sendmsg_nosec函数:
static inline int sock_sendmsg_nosec(struct socket *sock, struct msghdr *msg)
{
int ret = INDIRECT_CALL_INET(sock->ops->sendmsg, inet6_sendmsg,
inet_sendmsg, sock, msg,
msg_data_left(msg));
//...
}
可见,sock_sendmsg_nosec()最终调用inet6_sendmsg()(IPv6)或inet_sendmsg():
INDIRECT_CALL_INET宏使用INDIRECT_CALL_2,如果sock→ops→sendmsg指向inet6_sendmsg或inet_sendmsg, 则直接调用这两个函数,以减少retpoline技术带来的开销。
inet_sendmsg()
inet_sendmsg():
int inet_sendmsg(struct socket *sock, struct msghdr *msg, size_t size)
{
//...
return INDIRECT_CALL_2(sk->sk_prot->sendmsg, tcp_sendmsg, udp_sendmsg,
sk, msg, size);
}
inet6_sendmsg():
int inet6_sendmsg(struct socket *sock, struct msghdr *msg, size_t size)
{
//...
return INDIRECT_CALL_2(prot->sendmsg, tcp_sendmsg, udpv6_sendmsg,
sk, msg, size);
}
可见:
对于tcp, 最终调用tcp_sendmsg()。
对于udp, 最终调用udpv6_sendmsg()/udp_sendmsg()。
传输层
tcp_sendmsg()
tcp_sendmsg按MSS进行分段、添加到发送队列并将用户数据拷贝到分段里面,根据相关判断设置PSH标志,最后调用__tcp_push_pending_frames、tcp_push_one、tcp_push函数发送报文。
tcp_sendmsg_locked __tcp_push_pending_frames tcp_write_xmit tcp_push_one tcp_write_xmit tcp_push __tcp_push_pending_frames tcp_write_xmit
tcp_write_xmit(): https://elixir.bootlin.com/linux/latest/source/net/ipv4/tcp_output.c
tcp_transmit_skb __tcp_transmit_skb ip_queue_xmit: https://elixir.bootlin.com/linux/latest/source/net/ipv4/ip_output.c ip_local_out
udp_sendmsg()
udp_send_skb ip_send_skb: https://elixir.bootlin.com/linux/latest/source/net/ipv4/ip_output.c ip_local_out
网络层
ip_local_out()
int __ip_local_out(struct net *net, struct sock *sk, struct sk_buff *skb)
{
struct iphdr *iph = ip_hdr(skb);
iph_set_totlen(iph, skb->len);
ip_send_check(iph);
/* if egress device is enslaved to an L3 master device pass the
* skb to its handler for processing
*/
skb = l3mdev_ip_out(sk, skb);
if (unlikely(!skb))
return 0;
skb->protocol = htons(ETH_P_IP);
return nf_hook(NFPROTO_IPV4, NF_INET_LOCAL_OUT,
net, sk, skb, NULL, skb_dst(skb)->dev,
dst_output);
}
int ip_local_out(struct net *net, struct sock *sk, struct sk_buff *skb)
{
int err;
err = __ip_local_out(net, sk, skb);
if (likely(err == 1))
err = dst_output(net, sk, skb);
return err;
}
ip_local_out调用__ip_local_out,如果返回值为1,则调用路由层dst_output发送数据包。
__ip_local_out(net, sk, skb):
设置IP数据包的长度;
然后调用ip_send_check来计算要写入IP头的校验和;
之后调用nf_hook进入netfilter,其返回值将传递回ip_local_out。如果nf_hook返回1,则表示允许数据包通过,并且调用者应该自己发送数据包。即上文中的ip_local_out检查返回值1,调用dst_output发送数据包。
nf_hook与netfilter:
nf_hook是一个wrapper,它会检查是否有为这个协议族和hook类型(这里分别为 NFPROTO_IPV4和NF_INET_LOCAL_OUT)安装过滤器,然后将返回到IP协议层,避免深入到netfilter或更下面,比如iptables和conntrack。
如果有非常多或者非常复杂的netfilter或iptables规则,那些规则将在触发sendmsg系统调用的用户进程的上下文中执行。如果对这个用户进程设置了CPU亲和性,相应的CPU将花费系统时间(system time)处理出站(outbound)iptables规则。
假设nf_hook返回1,表示调用者(在这种情况下是IP协议层)应该自己发送数据包。
dst_output()
dst代码在内核中实现协议无关的目标缓存。
/* Output packet to network from transport. */
static inline int dst_output(struct net *net, struct sock *sk, struct sk_buff *skb)
{
return INDIRECT_CALL_INET(skb_dst(skb)->output,
ip6_output, ip_output,
net, sk, skb);
}
dst_output()查找关联到skb的dst条目,然后调用output方法,即ip_output方法(IPv6则为ip6_output方法)。
ip_output()
ip_output()调用ip_finish_output():
int ip_output(struct net *net, struct sock *sk, struct sk_buff *skb)
{
struct net_device *dev = skb_dst(skb)->dev, *indev = skb->dev;
IP_UPD_PO_STATS(net, IPSTATS_MIB_OUT, skb->len);
skb->dev = dev;
skb->protocol = htons(ETH_P_IP);
return NF_HOOK_COND(NFPROTO_IPV4, NF_INET_POST_ROUTING,
net, sk, skb, indev, dev,
ip_finish_output,
!(IPCB(skb)->flags & IPSKB_REROUTED));
}
首先,更新IPSTATS_MIB_OUT统计计数:IP_UPD_PO_STATS宏将更新字节数和包数统计。
之后,设置要发送此skb的设备以及协议。
最后,通过调用NF_HOOK_COND将控制权交给netfilter。NF_HOOK_COND通过检查传入的条件来工作:如果此条件为真,则skb将发送给netfilter。如果netfilter允许包通过,okfn回调函数将被调用,此处okfn是ip_finish_output。
ip_finish_output()
ip_finish_output()调用__ip_finish_output():
static int __ip_finish_output(struct net *net, struct sock *sk, struct sk_buff *skb)
{
unsigned int mtu;
#if defined(CONFIG_NETFILTER) && defined(CONFIG_XFRM)
/* Policy lookup after SNAT yielded a new policy */
if (skb_dst(skb)->xfrm) {
IPCB(skb)->flags |= IPSKB_REROUTED;
return dst_output(net, sk, skb);
}
#endif
mtu = ip_skb_dst_mtu(sk, skb);
if (skb_is_gso(skb))
return ip_finish_output_gso(net, sk, skb, mtu);
if (skb->len > mtu || IPCB(skb)->frag_max_size)
return ip_fragment(net, sk, skb, mtu, ip_finish_output2);
return ip_finish_output2(net, sk, skb);
}
如果启用了netfilter和数据包转换(XFRM),则更新skb标志并调用dst_output将其发回;
否则(通常情况):
1 如果是gso则调用ip_finish_output_gso()
2 如果数据包需要分片则调用ip_fragment()
3 直接调用ip_finish_output2()
这三个函数均会调用ip_finish_output2()。
ip_finish_output2()
static int ip_finish_output2(struct net *net, struct sock *sk, struct sk_buff *skb)
{
//...
neigh = ip_neigh_for_gw(rt, skb, &is_v6gw);
if (!IS_ERR(neigh)) {
//...
res = neigh_output(neigh, skb, is_v6gw);
//...
}
//...
}
ip_neigh_for_gw()查找邻居缓存,如果未找到会创建一个邻居。
之后调用邻居系统的neigh_output()。
邻居系统
neigh_output()
static inline int neigh_output(struct neighbour *n, struct sk_buff *skb,
bool skip_cache)
{
const struct hh_cache *hh = &n->hh;
/* n->nud_state and hh->hh_len could be changed under us.
* () is taking care of the race later.
*/
if (!skip_cache &&
(READ_ONCE(n->nud_state) & NUD_CONNECTED) &&
READ_ONCE(hh->hh_len))
return neigh_hh_output(hh, skb);
return n->output(n, skb);
}
如果目标是NUD_CONNECTED且硬件头(hh: hardware header)已被缓存(之前发送过数据并进行了缓存),将调用neigh_hh_output函数;
否则,调用n→output函数。
先说一下结论,这两种情况,最后都会调用dev_queue_xmit(),将skb发送给网络设备子系统,在它进入设备驱动程序层之前将对其进行更多处理。
struct neighbour的output函数指针
下面介绍n→output()的流程:
如果在邻居缓存中找不到邻居条目,会从ip_finish_output2调用__neigh_create创建一个。当调用__neigh_create时,将分配邻居,其output函数最初设置为neigh_blackhole()。之后,它将根据邻居的状态修改output值以指向适当的发送方法。
static int arp_constructor(struct neighbour *neigh)
{
//...
if (!dev->header_ops) {
neigh->nud_state = NUD_NOARP;
neigh->ops = &arp_direct_ops;
neigh->output = neigh_direct_output;
} else {
//...
if (dev->header_ops->cache)
neigh->ops = &arp_hh_ops;
else
neigh->ops = &arp_generic_ops;
if (neigh->nud_state & NUD_VALID)
neigh->output = neigh->ops->connected_output;
else
neigh->output = neigh->ops->output;
}
return 0;
}
可见neigh→output有三种取值:
第一种情况是当neigh→ops指向arp_direct_ops,此时neigh→output指向neigh_direct_output();
另外两种情况是neigh→ops指向arp_hh_ops或者arp_generic_ops,此时neigh→output要么指向neigh_resolve_output(),要么指向neigh_connected_output()。
neigh_direct_output(), neigh_resolve_output()以及neigh_connected_output(),
这三个函数,无一例外,最终都会调用dev_queue_xmit()。
设备系统
dev_queue_xmit()
static inline int dev_queue_xmit(struct sk_buff *skb)
{
return __dev_queue_xmit(skb, NULL);
}
/**
* __dev_queue_xmit() - transmit a buffer
* @skb: buffer to transmit
* @sb_dev: suboordinate device used for L2 forwarding offload
*
* Queue a buffer for transmission to a network device. The caller must
* have set the device and priority and built the buffer before calling
* this function. The function can be called from an interrupt.
*
* When calling this method, interrupts MUST be enabled. This is because
* the BH enable code must have IRQs enabled so that it will not deadlock.
*
* Regardless of the return value, the skb is consumed, so it is currently
* difficult to retry a send to this method. (You can bump the ref count
* before sending to hold a reference for retry if you are careful.)
*
* Return:
* * 0 - buffer successfully transmitted
* * positive qdisc return code - NET_XMIT_DROP etc.
* * negative errno - other errors
*/
int __dev_queue_xmit(struct sk_buff *skb, struct net_device *sb_dev)
{
struct net_device *dev = skb->dev;
struct netdev_queue *txq = NULL;
struct Qdisc *q;
int rc = -ENOMEM;
bool again = false;
//...
if (!txq)
txq = netdev_core_pick_tx(dev, skb, sb_dev);
q = rcu_dereference_bh(txq->qdisc);
trace_net_dev_queue(skb);
if (q->enqueue) {
rc = __dev_xmit_skb(skb, q, dev, txq);
goto out;
}
/* The device has no queue. Common case for software devices:
* loopback, all the sorts of tunnels...
* Really, it is unlikely that netif_tx_lock protection is necessary
* here. (f.e. loopback and IP tunnels are clean ignoring statistics
* counters.)
* However, it is possible, that they rely on protection
* made by us here.
* Check this and shot the lock. It is not prone from deadlocks.
*Either shot noqueue qdisc, it is even simpler 8)
*/
if (dev->flags & IFF_UP) {
int cpu = smp_processor_id(); /* ok because BHs are off */
/* Other cpus might concurrently change txq->xmit_lock_owner
* to -1 or to their cpu id, but not to our id.
*/
if (READ_ONCE(txq->xmit_lock_owner) != cpu) {
if (dev_xmit_recursion())
goto recursion_alert;
skb = validate_xmit_skb(skb, dev, &again);
if (!skb)
goto out;
HARD_TX_LOCK(dev, txq, cpu);
if (!netif_xmit_stopped(txq)) {
dev_xmit_recursion_inc();
skb = dev_hard_start_xmit(skb, dev, txq, &rc);
dev_xmit_recursion_dec();
if (dev_xmit_complete(rc)) {
HARD_TX_UNLOCK(dev, txq);
goto out;
}
}
//...
} else {
//...
}
}
//...
}
现代网卡是有多个发送队列的,__dev_queue_xmit()首先需要选择一个发送队列到struct netdev_queue *txq,之后调用__dev_xmit_skb():
(其实还有dev_hard_start_xmit()这条非主线路径,后面会分析dev_hard_start_xmit(),这里作一个简要说明:
环回设备和隧道设备没有队列,如果设备当前处于运行状态:
如果发送锁不由此CPU拥有,dev_xmit_recursion()会函数判断per-CPU计数器变量softnet_data.xmit.recursion是否超过XMIT_RECURSION_LIMIT。需要这个计数是因为一个程序可能会在这段代码这里持续发送数据,然后被抢占,调度程序选择另一个程序来运行。如果没有超过,调用dev_hard_start_xmit()。
如果当前CPU是发送锁的拥有者,或者计数超过XMIT_RECURSION_LIMIT,则不进行发送,此时打印告警日志,设置错误码并返回。)
__dev_xmit_skb
static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
struct net_device *dev,
struct netdev_queue *txq)
{
spinlock_t *root_lock = qdisc_lock(q);
struct sk_buff *to_free = NULL;
bool contended;
int rc;
qdisc_calculate_pkt_len(skb, q);
if (q->flags & TCQ_F_NOLOCK) {
if (q->flags & TCQ_F_CAN_BYPASS && nolock_qdisc_is_empty(q) &&
qdisc_run_begin(q)) {
/* Retest nolock_qdisc_is_empty() within the protection
* of q->seqlock to protect from racing with requeuing.
*/
if (unlikely(!nolock_qdisc_is_empty(q))) {
rc = dev_qdisc_enqueue(skb, q, &to_free, txq);
__qdisc_run(q);
qdisc_run_end(q);
goto no_lock_out;
}
qdisc_bstats_cpu_update(q, skb);
if (sch_direct_xmit(skb, q, dev, txq, NULL, true) &&
!nolock_qdisc_is_empty(q))
__qdisc_run(q);
qdisc_run_end(q);
return NET_XMIT_SUCCESS;
}
rc = dev_qdisc_enqueue(skb, q, &to_free, txq);
qdisc_run(q);
no_lock_out:
if (unlikely(to_free))
kfree_skb_list_reason(to_free,
SKB_DROP_REASON_QDISC_DROP);
return rc;
}
/*
* Heuristic to force contended enqueues to serialize on a
* separate lock before trying to get qdisc main lock.
* This permits qdisc->running owner to get the lock more
* often and dequeue packets faster.
* On PREEMPT_RT it is possible to preempt the qdisc owner during xmit
* and then other tasks will only enqueue packets. The packets will be
* sent after the qdisc owner is scheduled again. To prevent this
* scenario the task always serialize on the lock.
*/
contended = qdisc_is_running(q) || IS_ENABLED(CONFIG_PREEMPT_RT);
if (unlikely(contended))
spin_lock(&q->busylock);
spin_lock(root_lock);
if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
__qdisc_drop(skb, &to_free);
rc = NET_XMIT_DROP;
} else if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
qdisc_run_begin(q)) {
/*
* This is a work-conserving queue; there are no old skbs
* waiting to be sent out; and the qdisc is not running -
* xmit the skb directly.
*/
qdisc_bstats_update(q, skb);
if (sch_direct_xmit(skb, q, dev, txq, root_lock, true)) {
if (unlikely(contended)) {
spin_unlock(&q->busylock);
contended = false;
}
__qdisc_run(q);
}
qdisc_run_end(q);
rc = NET_XMIT_SUCCESS;
} else {
rc = dev_qdisc_enqueue(skb, q, &to_free, txq);
if (qdisc_run_begin(q)) {
if (unlikely(contended)) {
spin_unlock(&q->busylock);
contended = false;
}
__qdisc_run(q);
qdisc_run_end(q);
}
}
spin_unlock(root_lock);
if (unlikely(to_free))
kfree_skb_list_reason(to_free, SKB_DROP_REASON_QDISC_DROP);
if (unlikely(contended))
spin_unlock(&q->busylock);
return rc;
}
关键函数是__qdisc_run()与sch_direct_xmit()。
而__qdisc_run()最终也会调用sch_direct_xmit(): https://elixir.bootlin.com/linux/latest/source/net/sched/sch_generic.c
__qdisc_run() qdisc_restart() sch_direct_xmit() dev_hard_start_xmit() __netif_schedule() __netif_reschedule() raise_softirq_irqoff(NET_TX_SOFTIRQ) net_tx_action() qdisc_run() __qdisc_run()//软中断的路径会继续调用__qdisc_run
可见,最终会调用dev_hard_start_xmit()。在分析dev_hard_start_xmit()之前,先看一下__qdisc_run()的实现。
__qdisc_run()
void __qdisc_run(struct Qdisc *q)
{
int quota = READ_ONCE(dev_tx_weight);
int packets;
while (qdisc_restart(q, &packets)) {
quota -= packets;
if (quota <= 0) {
if (q->flags & TCQ_F_NOLOCK)
set_bit(__QDISC_STATE_MISSED, &q->state);
else
__netif_schedule(q);
break;
}
}
}
quota是一个定额,其值为dev_tx_weight(默认为64),可以通过系统配置。
只要没有超过这个限额值,且队列里有skb, 就会一直循环调用qdisc_restart()发送数据包;
一旦超出这个限额值:
如果队列支持无锁运行(TCQ_F_NOLOCK),则设置队列状态为__QDISC_STATE_MISSED,在将来的某个时间点继续处理未完成的数据包;
否则(即队列不支持无锁运行),则调用__netif_schedule(q)将队列放入网络设备的调度列表中,之后继续在软中断中处理队列中的数据包。
在__netif_schedule()里,会触发NET_TX_SOFTIRQ软中断,但从上述代码中可以看出,并不是每次发送都会调用__netif_schedule()的,也就是说,并不是每次发送都会触发TX软中断,但接收不是,接收会触发NET_RX_SOFTIRQ,
因此通常RX软中断(NET_RX_SOFTIRQ)比TX软中断(NET_TX_SOFTIRQ)要多。
dev_hard_start_xmit()
dev_hard_start_xmit(): https://elixir.bootlin.com/linux/latest/source/net/core/dev.c
xmit_one() netdev_start_xmit() __netdev_start_xmit()
__netdev_start_xmit():
static inline netdev_tx_t __netdev_start_xmit(const struct net_device_ops *ops,
struct sk_buff *skb, struct net_device *dev,
bool more)
{
__this_cpu_write(softnet_data.xmit.more, more);
return ops->ndo_start_xmit(skb, dev);
}
ops→ndo_start_xmit()即为网卡驱动的发送函数。
网卡驱动
igb_xmit_frame()
以intel igb网卡为例,网卡驱动的发送函数即igb_xmit_frame():
static const struct net_device_ops igb_netdev_ops = {
//...
.ndo_start_xmit = igb_xmit_frame,
//...
};
igb_xmit_frame()调用igb_xmit_frame_ring()。
igb_xmit_frame_ring()
netdev_tx_t igb_xmit_frame_ring(struct sk_buff *skb,
struct igb_ring *tx_ring)
{
//...
tso = igb_tso(tx_ring, first, &hdr_len);
if (tso < 0)
goto out_drop;
else if (!tso)
igb_tx_csum(tx_ring, first);
if (igb_tx_map(tx_ring, first, hdr_len))
//...
}
igb_tx_map函数将skb数据映射到网卡可访问的内存DMA区域。
igb_tx_map()
static int igb_tx_map(struct igb_ring *tx_ring,
struct igb_tx_buffer *first,
const u8 hdr_len)
{
struct sk_buff *skb = first->skb;
struct igb_tx_buffer *tx_buffer;
union e1000_adv_tx_desc *tx_desc;
skb_frag_t *frag;
dma_addr_t dma;
unsigned int data_len, size;
u32 tx_flags = first->tx_flags;
u32 cmd_type = igb_tx_cmd_type(skb, tx_flags);
u16 i = tx_ring->next_to_use;
tx_desc = IGB_TX_DESC(tx_ring, i);
igb_tx_olinfo_status(tx_ring, tx_desc, tx_flags, skb->len - hdr_len);
size = skb_headlen(skb);
data_len = skb->data_len;
dma = dma_map_single(tx_ring->dev, skb->data, size, DMA_TO_DEVICE);
tx_buffer = first;
for (frag = &skb_shinfo(skb)->frags[0];; frag++) {
if (dma_mapping_error(tx_ring->dev, dma))
goto dma_error;
/* record length, and DMA address */
dma_unmap_len_set(tx_buffer, len, size);
dma_unmap_addr_set(tx_buffer, dma, dma);
tx_desc->read.buffer_addr = cpu_to_le64(dma);
while (unlikely(size > IGB_MAX_DATA_PER_TXD)) {
tx_desc->read.cmd_type_len =
cpu_to_le32(cmd_type ^ IGB_MAX_DATA_PER_TXD);
i++;
tx_desc++;
if (i == tx_ring->count) {
tx_desc = IGB_TX_DESC(tx_ring, 0);
i = 0;
}
tx_desc->read.olinfo_status = 0;
dma += IGB_MAX_DATA_PER_TXD;
size -= IGB_MAX_DATA_PER_TXD;
tx_desc->read.buffer_addr = cpu_to_le64(dma);
}
if (likely(!data_len))
break;
tx_desc->read.cmd_type_len = cpu_to_le32(cmd_type ^ size);
i++;
tx_desc++;
if (i == tx_ring->count) {
tx_desc = IGB_TX_DESC(tx_ring, 0);
i = 0;
}
tx_desc->read.olinfo_status = 0;
size = skb_frag_size(frag);
data_len -= size;
dma = skb_frag_dma_map(tx_ring->dev, frag, 0,
size, DMA_TO_DEVICE);
tx_buffer = &tx_ring->tx_buffer_info[i];
}
/* write last descriptor with RS and EOP bits */
cmd_type |= size | IGB_TXD_DCMD;
tx_desc->read.cmd_type_len = cpu_to_le32(cmd_type);
netdev_tx_sent_queue(txring_txq(tx_ring), first->bytecount);
/* set the timestamp */
first->time_stamp = jiffies;
skb_tx_timestamp(skb);
/* Force memory writes to complete before letting h/w know there
* are new descriptors to fetch. (Only applicable for weak-ordered
* memory model archs, such as IA-64).
*
* We also need this memory barrier to make certain all of the
* status bits have been updated before next_to_watch is written.
*/
dma_wmb();
/* set next_to_watch value indicating a packet is present */
first->next_to_watch = tx_desc;
i++;
if (i == tx_ring->count)
i = 0;
tx_ring->next_to_use = i;
/* Make sure there is space in the ring for the next send. */
igb_maybe_stop_tx(tx_ring, DESC_NEEDED);
if (netif_xmit_stopped(txring_txq(tx_ring)) || !netdev_xmit_more()) {
writel(i, tx_ring->tail);
}
return 0;
//...
}
发送完成
概要
设备发送完成后,设备会触发一个硬中断(IRQ),表示发送完成。然后,设备驱动程序可以调度一些长时间运行的工作,例如解除DMA映射、释放数据等,具体如何工作则取决于不同的设备。
对于igb驱动程序及其关联设备,发送完成和数据包接收所触发的IRQ是相同的,这也是RX软中断(NET_RX_SOFTIRQ)比TX软中断(NET_TX_SOFTIRQ)要多的另外一个原因。
硬中断处理
igb_msix_ring(): https://elixir.bootlin.com/linux/latest/source/drivers/net/ethernet/intel/igb/igb_main.c
igb_msix_ring() - drivers/net/ethernet/intel/igb/igb_main.c __napi_schedule() - net/core/dev.c ____napi_schedule() - net/core/dev.c __raise_softirq_irqoff(NET_RX_SOFTIRQ) - net/core/dev.c
__raise_softirq_irqoff(NET_RX_SOFTIRQ)发起NET_RX_SOFTIRQ软中断。
软中断处理
ksoftirqd为软中断处理进程,ksoftirqd收到NET_RX_SOFTIRQ软中断后,执行软中断处理函数net_rx_action(),调用网卡驱动poll()函数(对于igb网卡为igb_poll函数)收包,在poll()函数中调用igb_clean_tx_irq()完成数据包的释放。
run_ksoftirqd() - kernel/softirqd.c __do_softirq() - kernel/softirqd.c h->action(h) - kernel/softirqd.c net_rx_action() - net/core/dev.c napi_poll() - net/core/dev.c __napi_poll - net/core/dev.c work = n->poll(n, weight) - net/core/dev.c igb_poll() - drivers/net/ethernet/intel/igb/igb_main.c igb_clean_tx_irq() - drivers/net/ethernet/intel/igb/igb_main.c
igb_clean_tx_irq()释放、回收资源。