[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH] net: add initial support for AF_XDP network backend
From: |
Jason Wang |
Subject: |
Re: [PATCH] net: add initial support for AF_XDP network backend |
Date: |
Fri, 30 Jun 2023 15:44:25 +0800 |
On Wed, Jun 28, 2023 at 7:14 PM Ilya Maximets <i.maximets@ovn.org> wrote:
>
> On 6/28/23 05:27, Jason Wang wrote:
> > On Wed, Jun 28, 2023 at 6:45 AM Ilya Maximets <i.maximets@ovn.org> wrote:
> >>
> >> On 6/27/23 04:54, Jason Wang wrote:
> >>> On Mon, Jun 26, 2023 at 9:17 PM Ilya Maximets <i.maximets@ovn.org> wrote:
> >>>>
> >>>> On 6/26/23 08:32, Jason Wang wrote:
> >>>>> On Sun, Jun 25, 2023 at 3:06 PM Jason Wang <jasowang@redhat.com> wrote:
> >>>>>>
> >>>>>> On Fri, Jun 23, 2023 at 5:58 AM Ilya Maximets <i.maximets@ovn.org>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> AF_XDP is a network socket family that allows communication directly
> >>>>>>> with the network device driver in the kernel, bypassing most or all
> >>>>>>> of the kernel networking stack. In the essence, the technology is
> >>>>>>> pretty similar to netmap. But, unlike netmap, AF_XDP is Linux-native
> >>>>>>> and works with any network interfaces without driver modifications.
> >>>>>>> Unlike vhost-based backends (kernel, user, vdpa), AF_XDP doesn't
> >>>>>>> require access to character devices or unix sockets. Only access to
> >>>>>>> the network interface itself is necessary.
> >>>>>>>
> >>>>>>> This patch implements a network backend that communicates with the
> >>>>>>> kernel by creating an AF_XDP socket. A chunk of userspace memory
> >>>>>>> is shared between QEMU and the host kernel. 4 ring buffers (Tx, Rx,
> >>>>>>> Fill and Completion) are placed in that memory along with a pool of
> >>>>>>> memory buffers for the packet data. Data transmission is done by
> >>>>>>> allocating one of the buffers, copying packet data into it and
> >>>>>>> placing the pointer into Tx ring. After transmission, device will
> >>>>>>> return the buffer via Completion ring. On Rx, device will take
> >>>>>>> a buffer form a pre-populated Fill ring, write the packet data into
> >>>>>>> it and place the buffer into Rx ring.
> >>>>>>>
> >>>>>>> AF_XDP network backend takes on the communication with the host
> >>>>>>> kernel and the network interface and forwards packets to/from the
> >>>>>>> peer device in QEMU.
> >>>>>>>
> >>>>>>> Usage example:
> >>>>>>>
> >>>>>>> -device virtio-net-pci,netdev=guest1,mac=00:16:35:AF:AA:5C
> >>>>>>> -netdev af-xdp,ifname=ens6f1np1,id=guest1,mode=native,queues=1
> >>>>>>>
> >>>>>>> XDP program bridges the socket with a network interface. It can be
> >>>>>>> attached to the interface in 2 different modes:
> >>>>>>>
> >>>>>>> 1. skb - this mode should work for any interface and doesn't require
> >>>>>>> driver support. With a caveat of lower performance.
> >>>>>>>
> >>>>>>> 2. native - this does require support from the driver and allows to
> >>>>>>> bypass skb allocation in the kernel and potentially use
> >>>>>>> zero-copy while getting packets in/out userspace.
> >>>>>>>
> >>>>>>> By default, QEMU will try to use native mode and fall back to skb.
> >>>>>>> Mode can be forced via 'mode' option. To force 'copy' even in native
> >>>>>>> mode, use 'force-copy=on' option. This might be useful if there is
> >>>>>>> some issue with the driver.
> >>>>>>>
> >>>>>>> Option 'queues=N' allows to specify how many device queues should
> >>>>>>> be open. Note that all the queues that are not open are still
> >>>>>>> functional and can receive traffic, but it will not be delivered to
> >>>>>>> QEMU. So, the number of device queues should generally match the
> >>>>>>> QEMU configuration, unless the device is shared with something
> >>>>>>> else and the traffic re-direction to appropriate queues is correctly
> >>>>>>> configured on a device level (e.g. with ethtool -N).
> >>>>>>> 'start-queue=M' option can be used to specify from which queue id
> >>>>>>> QEMU should start configuring 'N' queues. It might also be necessary
> >>>>>>> to use this option with certain NICs, e.g. MLX5 NICs. See the docs
> >>>>>>> for examples.
> >>>>>>>
> >>>>>>> In a general case QEMU will need CAP_NET_ADMIN and CAP_SYS_ADMIN
> >>>>>>> capabilities in order to load default XSK/XDP programs to the
> >>>>>>> network interface and configure BTF maps.
> >>>>>>
> >>>>>> I think you mean "BPF" actually?
> >>>>
> >>>> "BPF Type Format maps" kind of makes some sense, but yes. :)
> >>>>
> >>>>>>
> >>>>>>> It is possible, however,
> >>>>>>> to run only with CAP_NET_RAW.
> >>>>>>
> >>>>>> Qemu often runs without any privileges, so we need to fix it.
> >>>>>>
> >>>>>> I think adding support for SCM_RIGHTS via monitor would be a way to go.
> >>>>
> >>>> I looked through the code and it seems like we can run completely
> >>>> non-privileged as far as kernel concerned. We'll need an API
> >>>> modification in libxdp though.
> >>>>
> >>>> The thing is, IIUC, the only syscall that requires CAP_NET_RAW is
> >>>> a base socket creation. Binding and other configuration doesn't
> >>>> require any privileges. So, we could create a socket externally
> >>>> and pass it to QEMU.
> >>>
> >>> That's the way TAP works for example.
> >>>
> >>>> Should work, unless it's an oversight from
> >>>> the kernel side that needs to be patched. :) libxdp doesn't have
> >>>> a way to specify externally created socket today, so we'll need
> >>>> to change that. Should be easy to do though. I can explore.
> >>>
> >>> Please do that.
> >>
> >> I have a prototype:
> >>
> >> https://github.com/igsilya/xdp-tools/commit/db73e90945e3aa5e451ac88c42c83cb9389642d3
> >>
> >> Need to test it out and then submit PR to xdp-tools project.
> >>
> >>>
> >>>>
> >>>> In case the bind syscall will actually need CAP_NET_RAW for some
> >>>> reason, we could change the kernel and allow non-privileged bind
> >>>> by utilizing, e.g. SO_BINDTODEVICE. i.e., let the privileged
> >>>> process bind the socket to a particular device, so QEMU can't
> >>>> bind it to a random one. Might be a good use case to allow even
> >>>> if not strictly necessary.
> >>>
> >>> Yes.
> >>
> >> Will propose something for a kernel as well. We might want something
> >> more granular though, e.g. bind to a queue instead of a device. In
> >> case we want better control in the device sharing scenario.
> >
> > I may miss something but the bind is already done at dev plus queue
> > right now, isn't it?
>
>
> Yes, the bind() syscall will bind socket to the dev+queue. I was talking
> about SO_BINDTODEVICE that only ties the socket to a particular device,
> but not a queue.
>
> Assuming SO_BINDTODEVICE is implemented for AF_XDP sockets and
> assuming a privileged process does:
>
> fd = socket(AF_XDP, ...);
> setsockopt(fd, SOL_SOCKET, SO_BINDTODEVICE, <device>);
>
> And sends fd to a non-privileged process. That non-privileged process
> will be able to call:
>
> bind(fd, <device>, <random queue>);
>
> It will have to use the same device, but can choose any queue, if that
> queue is not already busy with another socket.
>
> So, I was thinking maybe implementing something like XDP_BINDTOQID option.
> This way the privileged process may call:
>
> setsockopt(fd, SOL_XDP, XDP_BINDTOQID, <device>, <queue>);
>
> And later kernel will be able to refuse bind() for any other queue for
> this particular socket.
Not sure, if file descriptor passing works, we probably don't need another way.
>
> Not sure if that is necessary though.
> Since we're allocating the socket in the privileged process, that process
> may add the socket to the BPF map on the correct queue id. This way the
> non-privileged process will not be able to receive any packets from any
> other queue on this socket, even if bound to it. And no other AF_XDP
> socket will be able to be bound to that other queue as well.
I think that's by design, or anything wrong with this model?
> So, the
> rogue QEMU will be able to hog one extra queue, but it will not be able
> to intercept traffic any from it, AFAICT. May not be a huge problem
> after all.
>
> SO_BINDTODEVICE would still be nice to have. Especially for cases where
> we give the whole device to one VM.
Then we need to use AF_XDP in the guest which seems to be a different
topic. Alibaba is working on the AF_XDP support for virtio-net.
Thanks
>
> Best regards, Ilya Maximets.
>
- Re: [PATCH] net: add initial support for AF_XDP network backend, (continued)
- Re: [PATCH] net: add initial support for AF_XDP network backend, Jason Wang, 2023/06/27
- Re: [PATCH] net: add initial support for AF_XDP network backend, Stefan Hajnoczi, 2023/06/28
- Re: [PATCH] net: add initial support for AF_XDP network backend, Jason Wang, 2023/06/28
- Re: [PATCH] net: add initial support for AF_XDP network backend, Stefan Hajnoczi, 2023/06/28
- Re: [PATCH] net: add initial support for AF_XDP network backend, Jason Wang, 2023/06/28
- Re: [PATCH] net: add initial support for AF_XDP network backend, Stefan Hajnoczi, 2023/06/28
- Re: [PATCH] net: add initial support for AF_XDP network backend, Jason Wang, 2023/06/29
- Re: [PATCH] net: add initial support for AF_XDP network backend, Stefan Hajnoczi, 2023/06/29
- Re: [PATCH] net: add initial support for AF_XDP network backend, Jason Wang, 2023/06/30
- Re: [PATCH] net: add initial support for AF_XDP network backend, Ilya Maximets, 2023/06/28
- Re: [PATCH] net: add initial support for AF_XDP network backend,
Jason Wang <=
- Re: [PATCH] net: add initial support for AF_XDP network backend, Ilya Maximets, 2023/06/30
Re: [PATCH] net: add initial support for AF_XDP network backend, Stefan Hajnoczi, 2023/06/27