Bug #1806
closedPacket loss performance is worse in 3.1RC1 vs 3.0
Description
On a test server inspecting between 5.5 to 7 Gb/s, we've upgraded from Suricata v3.0 to v3.1RC1 and noticed that packet loss went from ~25% on v3.0 to ~45% on v3.1RC1. Suricata was built with the same compile options on both versions, which are:
--prefix=/usr --sysconfdir=/etc --localstatedir=/var --enable-gccprotect --disable-gccmarch-native --disable-coccinelle --enable-nfqueue --enable-af-packet --enable-jemalloc --with-libnspr-includes=/usr/include/nspr4 --with-libnss-includes=/usr/include/nss3 --enable-jansson --enable-geoip --enable-luajit --enable-hiredis
The exact same ruleset and config is used between both versions, and has been repeated over and over again. The config file being used is attached. The test system being used for this test is a Dell R610 with 4 physical processors of model "Intel(R) Xeon(R) CPU L5506 @ 2.13GHz" (8 with Hyperthreading, which we have enabled) and 24 GB of RAM.
Aside from the noticeable increase in packet loss, we have noticed a drastic reduction in the amount of time Suricata takes to start inspection traffic after the process starts, which has been a reduction from ~60 seconds to less than 3 seconds. It should also be noted that Suricata is running within a docker container for both 3.0 and 3.1RC1, each based on the same CentOS 7.2 base image.
Files
Updated by Victor Julien over 8 years ago
Could you attach the startup logs from 3.0 with -v and 3.1RC1 with -vvv?
The shorter startup time is expected. It's the result of a rewrite of part of the detection engine.
Updated by Chris Beverly over 8 years ago
- File startup-3.0.log startup-3.0.log added
- File startup-3.1RC1.log startup-3.1RC1.log added
Yeah, was saw the shorter startup time in the changelog (which is awesome!). Attached are the two logs for each version.
Updated by Victor Julien over 8 years ago
Couple things I noticed so far:
- please see if you can address this warning {"log":"13/6/2016 -- 15:49:29 - \u003cWarning\u003e - [ERRCODE: SC_WARN_POOR_RULE(276)] - rule 8000000: SYN-only to port(s) 13337:13337 w/o direction specified, disabling for toclient direction\n","stream":"stdout","time":"2016-06-13T15:49:29.716235751Z"}
- it seems in 3.1RC1 you're using AF_PACKETv3. Please try v2 (the default in Suricata 3.0) by adding this to your afpacket config "tpacket-v3: no"
- you're using very few rules (149), so I wouldn't expect the detection engine to have a large effect on the perf
Updated by Chris Beverly over 8 years ago
Turning off tpacket-v3 did make a noticeable impact on the rate of dropped packets, but there's still definitely a considerable difference between v3.0 and 3.1RC1. Traffic rates have changed a bit from the initial testing (different time of day), and here are the numbers out of "suricatasc iface-stat" for each of the versions (after letting the process run for ~300M packets):
v3.0:
iface-stat bond1
Success:
{
"drop": 140401761,
"invalid-checksums": 0,
"pkts": 300891030
}
v3.1RC1
iface-stat bond1
Success:
{
"drop": 169177519,
"invalid-checksums": 0,
"pkts": 303806025
}
This puts v3.0 at around 46% of packets dropped vs total, while 3.1RC1 is up closer to 56%.
Not sure if it helps or not, but just about every rule we have enabled are all thresholding rules, which look a lot like the following:
alert tcp $EXTERNAL_NET any -> $DSTVAR any (msg:"DDoS syn_by_dst [dst-protect]"; flow:stateless; flags:S; threshold:type both, track by_dst, count 15000, seconds 5; classtype:attempted-dos; sid:3000000; rev:1;)
These are the rules that caused us some very long startup times in 3.0, but 3.1RC1 definitely starts up much faster with these rules in place. I've previously confirmed that with these rules disabled in 3.0, it would start up in under 5 seconds as opposed to 60 to 120 seconds.
Updated by Victor Julien over 8 years ago
Are you able to (privately) share your ruleset?
Updated by Chris Beverly over 8 years ago
Absolutely, just need to know how and where to send it.
Updated by Victor Julien over 8 years ago
Can you email me at victor@inliniac.net?
Updated by Peter Manev over 8 years ago
I was looking at the previously provided suricata.yaml. I notice two things that may have affected things:
1.
In the af-packet section you have both
ring-size: 300000 # On busy system, this could help to set it to yes to recover from a packet drop # phase. This will result in some packets (at max a ring flush) being non treated. use-emergency-flush: no # recv buffer size, increase value could improve performance # buffer-size: 32768 buffer-size: 32768
I think with kernel > 3.2 you should have only ring-size enabled not both at the same time.
2.
Some yaml config adjustments - for example in the provided yaml in stream.reassembly.segments section differ form the default style -
https://redmine.openinfosecfoundation.org/projects/suricata/repository/revisions/master/entry/suricata.yaml.in#L1145
I have seen a few changes like those that have a "silent" unintended effect on the configuration and hence performance.
Updated by Chris Beverly over 8 years ago
Changing those settings didn't seem to make any discernible difference in performance. Here are the packet drop rates for each version with and without those config options commented out, no other changes to the suricata.yaml config:
3.0 - 41.0% loss as is, 40.7% with buffer-size and the segment section commented out
3.1RC1 - 52.6% loss as is, 51.9% with buffer-size and the segment section commented out
While there seems to be a minor difference in each of the versions for the config change, this may also be variable with traffic load. There is still a very noticeable different in performance between the two versions both with and without those config options you cited.
Victor provided me with an update via email regarding worse performance due to the rules we're mostly utilizing (packet threshold by destination) and the detection engine rewrite in 3.1RC1. I'm currently waiting to hear more info back on that.
Updated by Peter Manev over 8 years ago
Thanks for trying it out.
How do you run your tests exactly btw? What is your suricata start line?
Updated by Chris Beverly over 8 years ago
The tests are literally just starting up Suricata and waiting for the engine to receive a total of ~300 million packets, then just do the math on dropped vs total packets. Our startup line is:
/usr/bin/suricata -c /etc/suricata/suricata.yaml --af-packet
Updated by Peter Manev over 8 years ago
Could you try out something -
Change the value of max-pending-packets from the current 4096 to 65534 and run the tests again with 3.0 and 3.1RC1(v2/v3 afpacket) and see how would that affect the test results when inspecting on the 5.5-7Gbps speeds as you mention initially.
Updated by Chris Beverly over 8 years ago
It doesn't seem to have made any difference. I retested with both after making the change and let the process run until around 300 million packets before calculating the drop rate. Traffic rates are currently at 7 Gb/s, the drop rates were 44.2% on 3.0 and 53.8% on 3.1RC1.
Updated by Victor Julien over 8 years ago
- Status changed from New to Assigned
- Assignee set to Victor Julien
- Target version set to 70
I'm hoping to address this for 3.2. I have some ideas about adding more prefilter steps/engines.
Updated by Alexander Gozman about 8 years ago
Actually, 3.0.1 also seems to suffer from performance loss if compared to 2.1b4. We've seen a huge performance drop in afpacket inline mode with a large set of signatures (full Emerging Threats Pro set, ~26000 sigs) with mpm-algo set to 'ac' and sgh-mpm-context set to 'auto'. Without signatures everything worked fine with no performance decrease. Playing with max-pending-packets or af-packet params hasn't changed anything.
Probably the roots of evil are the same here ;)
PS: I can provide the ruleset and/or config file and more detailed description privately, if needed.
Updated by Victor Julien about 8 years ago
@Alex Lasky I'd be surprised if it's the same issue. Chris is running a highly specialized ruleset, and I've identified the cause for the slow down for that set between 3.0 and 3.1. Whatever you are seeing must have happened much earlier during the 3.0 development. Would be interesting if you can pinpoint what it is. Perhaps a git bisect would be useful.
Updated by Victor Julien about 7 years ago
- Status changed from Assigned to Closed
- Assignee deleted (
Victor Julien) - Target version deleted (
70)
The more generic prefilter engines that were added in 3.2 should address this.