Bug #6238
openAF-XDP crash when closing Suricata while receiving traffic
Description
I am currently doing research on AF_XDP and I encountered a bug that is present in multi-process and multi-threaded configurations of AF_XDP programs. I believe there is a race condition that causes an IO_PAGEFAULT and the entire OS to crash when it is encountered. This bug can be reproduced using Suricata release 7.0.0-rc1, or another program where multiple user space processes each with an AF_XDP socket are created.
I have attached some sample code that has should be able to reproduce the bug. This code creates n processes where n is the number of RX queues specified by the user. In my experience the higher the number of processes/RX queues used, the higher the likelihood of triggering the crash.
To change the number of RX queues, use Ethtool to set the number of combined RX queues, this may vary depending on network card:
sudo ethtool -L <interface> combined <number of RX queues>
Compile the code using make and run the code as such:
sudo -E ./xdp_main.o <interface> <number of child processes> consec
To get the crash to show up, lots of traffic needs to be sent to the network interface. In our experimental setup, a machine using Pktgen is sending traffic to the machine running the AF_XDP code at max line rate. Using Pktgen, vary the IP/MAC addresses of each packet to make sure the packets are somewhat evenly distributed across each RX queue. This may help with reproducing the bug. Also be sure the interface is set to promiscuous mode.
While sending traffic at max line rate, send a SIGINT to the AF_XDP program receiving the traffic to terminate the program. Sometimes an IO_PAGEFAULT will occur. This is more common than not. Also attached are some screen shots of the terminal and of the output our server gives.
The bug occurs because each process has the same STDIN file descriptor and as a result each child process gets the same SIGINT signal at the same time causing them all to terminate at once. During this, I believe a race condition is reached where the AF_XDP program is still receiving packets and is trying to write them to a UMEM that no longer exists. The order of operations to cause this would be:
1. XDP program looks up AF_XDP socket in XSKS_MAP
2. User space program deletes UMEM and/or AF_XDP socket
3. XDP program tries to write packet to UMEM
This is being addressed in the linux kernel but I have created a fix for suricata and am implementing it now.
Files
Updated by Joseph Reilly over 1 year ago
- Status changed from In Progress to In Review
Updated by Juliana Fajardini Reichow over 1 year ago
PR for review: https://github.com/OISF/suricata/pull/9315