Bug #6619
closedProfiling takes much longer to run than it used to
Description
While trying to profile for something else, I noticed that profiling in 7.x.x+ takes longer compared to 6.x.x builds.
I think the source is this commit
https://github.com/OISF/suricata/commit/b591813b8690753ee4a678f8b10ff1928de1fe15
#ifdef PROFILE_RULES
/* aggregate statistics */
if (SCTIME_SECS(p->ts) != det_ctx->rule_perf_last_sync) {
SCProfilingRuleThreatAggregate(det_ctx);
det_ctx->rule_perf_last_sync = SCTIME_SECS(p->ts);
}
#endif
This was done in the QA lab but should be reproducible by reading a PCAP with Suricata.
The slowdown can be anywhere from 2-10x longer. Runs were 2-3x longer when done with our CI but almost 10x longer when ssh'd and running in console, so I'm not what factors were at play there. ssh may be inherently slower as it is due to sending text back over the wire.
Files
Updated by Simeon Miteff 11 months ago
Corey Thomas wrote:
While trying to profile for something else, I noticed that profiling in 7.x.x+ takes longer compared to 6.x.x builds.
I think the source is this commit
https://github.com/OISF/suricata/commit/b591813b8690753ee4a678f8b10ff1928de1fe15This was done in the QA lab but should be reproducible by reading a PCAP with Suricata.
I've had a slowness problem processing a large archive of PCAPs that also affects 6.x.x builds (and possibly older). I found that by reducing the number of threads from the maximum for the (4-socket / 176 hw thread) machine I'm testing on to 4 worker threads achieved the best (but still terrible) performance.
Last week Jeff helped me trace the problem to rules profiling (which I can't turn off because that is the purpose of the PCAP processing). Yesterday I used VTune threading analysis to discover that with 176 worker threads, they are spending almost all their time waiting for the lock in `SCProfilingRuleThreatAggregate()`.
It gets called 33,568 times in 44.712s (every 1.3ms) whereas 1s/176 threads should be every ~5.8ms, so I'm not confident that the 1hz logic in the commit you reference is working as intended.
OTOH, it could be working but with many workers the order-of-10ms frequency drives up the lock contention to result in pathelogical performance. Either way, it might explain why I didn't see any change in 7.x.x+ like you did.
I haven't fully grokked the code but I would personally be happy with per-worker thread rules profiling output (similar to threaded EVE logging) and aggregate them externally (since I already do this over multiple parallel suricata executions).
Updated by Victor Julien 11 months ago
Looks like this is related to how time is managed. In the previous setup, on a fixed interval of wall time the threads would do their profiling processing. In b591813b8690753ee4a678f8b10ff1928de1fe15
this time management was shifted to the packet threads, which use the packet time. In pcaps, packet time can move much faster than wall time, so in these cases the processing is triggered (way) more often.
Updated by Victor Julien 11 months ago
- Status changed from New to In Review
- Assignee changed from OISF Dev to Victor Julien
- Target version changed from TBD to 8.0.0-beta1
- Label Needs backport to 7.0 added
Updated by Simeon Miteff 10 months ago
I've tested this code and I'm seeing a 5x speedup processing PCAPs, which is excellent, but I think rules profiling is now broken, I'm getting empty output like this:
$ cat rule_perf.log {"timestamp":"2024-01-09T02:53:41.345566+0000","sort":"ticks","rules":[]} {"timestamp":"2024-01-09T02:53:41.345590+0000","sort":"average ticks","rules":[]} {"timestamp":"2024-01-09T02:53:41.345593+0000","sort":"average ticks (match)","rules":[]} {"timestamp":"2024-01-09T02:53:41.345595+0000","sort":"average ticks (no match)","rules":[]} {"timestamp":"2024-01-09T02:53:41.345597+0000","sort":"number of checks","rules":[]} {"timestamp":"2024-01-09T02:53:41.345598+0000","sort":"number of matches","rules":[]} {"timestamp":"2024-01-09T02:53:41.345600+0000","sort":"max ticks","rules":[]}
Updated by Victor Julien 10 months ago
- Status changed from In Review to Resolved
https://github.com/OISF/suricata/pull/10134/commits/bcb2b50cfc34430e0e91dea781c90d2259ef8f0d
@Simeon Miteff it works here, can you describe your test?
Updated by Simeon Miteff 10 months ago
Victor Julien wrote in #note-7:
https://github.com/OISF/suricata/pull/10134/commits/bcb2b50cfc34430e0e91dea781c90d2259ef8f0d
@Simeon Miteff it works here, can you describe your test?
Built from commit 3df238d541793e8f55185e3cbd85f67bda1484f6 (HEAD, victor/detect-cleanups/v14)
with:
./configure --enable-profiling --enable-profiling-locks --enable-profiling-rules --enable-debug
Build info:
This is Suricata version 7.0.0-dev (b06c0579f 2022-09-26) Features: PCAP_SET_BUFF AF_PACKET HAVE_PACKET_FANOUT LIBCAP_NG HAVE_HTP_URI_NORMALIZE_HOOK PCRE_JIT HAVE_NSS HTTP2_DECOMPRESSION HAVE_LIBJANSSON PROFILING TLS TLS_C11 RUST SIMD support: SSE_4_2 SSE_4_1 SSE_3 Atomic intrinsics: 1 2 4 8 16 byte(s) 64-bits, Little-endian architecture GCC version 11.2.0, C version 201112 compiled with _FORTIFY_SOURCE=2 L1 cache line size (CLS)=64 thread local storage method: _Thread_local compiled with LibHTP v0.5.40, linked against LibHTP v0.5.45 Suricata Configuration: AF_PACKET support: yes DPDK support: no eBPF support: no XDP support: no PF_RING support: no NFQueue support: no NFLOG support: no IPFW support: no Netmap support: no DAG enabled: no Napatech enabled: no WinDivert enabled: no Unix socket enabled: yes Detection enabled: yes Libmagic support: no libjansson support: yes hiredis support: no hiredis async with libevent: no PCRE jit: yes LUA support: no libluajit: no GeoIP2 support: no Non-bundled htp: yes Hyperscan support: no Libnet support: no liblz4 support: yes Landlock support: yes Rust support: yes Rust strict mode: no Rust compiler path: /home/simeon/.cargo/bin/rustc Rust compiler version: rustc 1.61.0 (fe5b13d68 2022-05-18) Cargo path: /home/simeon/.cargo/bin/cargo Cargo version: cargo 1.61.0 (a028ae4 2022-04-29) Python support: yes Python path: /usr/bin/python3 Install suricatactl: yes Install suricatasc: yes Install suricata-update: no, not bundled Profiling enabled: yes Profiling locks enabled: no Plugin support (experimental): yes Development settings: Coccinelle / spatch: no Unit tests enabled: no Debug output enabled: no Debug validation enabled: no Fuzz targets enabled: no Generic build parameters: Installation prefix: /home/simeon/opt/suricata Configuration directory: /home/simeon/opt/suricata/etc/suricata/ Log directory: /home/simeon/opt/suricata/var/log/suricata/ --prefix /home/simeon/opt/suricata --sysconfdir /home/simeon/opt/suricata/etc --localstatedir /home/simeon/opt/suricata/var --datarootdir /home/simeon/opt/suricata/share Host: x86_64-pc-linux-gnu Compiler: gcc (exec name) / g++ (real) GCC Protect enabled: no GCC march native enabled: yes GCC Profile enabled: no Position Independent Executable enabled: no CFLAGS -g -O2 -fPIC -std=c11 -march=native -I${srcdir}/../rust/gen -I${srcdir}/../rust/dist PCAP_CFLAGS -I/usr/include SECCFLAGS
Configuration is the ~/opt/suricata/etc/suricata.yaml
from the build above (unmodified) which has rules profiling enabled.
Command line: suricata -vvv -k none --pcap-file-recursive -r ./pcap -S test.rules -c suricata.yaml
The test.rules
ruleset:
alert icmp any any -> any any (msg:"Test"; sid:1000000;)
Input PCAP put into the pcap/
directory is attached.
The single rule matches and triggers an alert:
$ head -n1 eve.json | jq . { "timestamp": "2012-04-12T05:49:23.977955+1000", "flow_id": 452525811821603, "pcap_cnt": 1, "event_type": "alert", "src_ip": "10.0.0.1", "src_port": 0, "dest_ip": "74.125.225.99", "dest_port": 0, "proto": "ICMP", "icmp_type": 8, "icmp_code": 0, "pkt_src": "wire/pcap", "alert": { "action": "allowed", "gid": 1, "signature_id": 1000000, "rev": 0, "signature": "Test", "category": "", "severity": 3 }, "flow": { "pkts_toserver": 1, "pkts_toclient": 0, "bytes_toserver": 98, "bytes_toclient": 0, "start": "2012-04-12T05:49:23.977955+1000" } }
Here are the last few lines of log output:
[1369548] 10/1/2024 -- 09:39:01 - (source-pcap-file.c:386) <Notice> (ReceivePcapFileThreadExitStats) -- Pcap-file module read 1 files, 1 packets, 98 bytes [1369547] 10/1/2024 -- 09:39:01 - (tmqh-flow.c:213) <Perf> (TmqhOutputFlowFreeCtx) -- AutoFP - Total flow handler queues - 12 [1369547] 10/1/2024 -- 09:39:01 - (counters.c:871) <Info> (StatsLogSummary) -- Alerts: 1 [1369547] 10/1/2024 -- 09:39:01 - (ippair.c:296) <Perf> (IPPairPrintStats) -- ippair memory usage: 414144 bytes, maximum: 16777216 [1369547] 10/1/2024 -- 09:39:01 - (util-profiling.c:297) <Perf> (SCProfilingDump) -- Done dumping profiling data. [1369547] 10/1/2024 -- 09:39:01 - (host.c:299) <Perf> (HostPrintStats) -- host memory usage: 398144 bytes, maximum: 33554432 [1369547] 10/1/2024 -- 09:39:01 - (util-profiling-rules.c:453) <Perf> (SCProfilingRuleDump) -- Dumping profiling data for 1 rules. [1369547] 10/1/2024 -- 09:39:01 - (util-profiling-rules.c:535) <Perf> (SCProfilingRuleDump) -- Done dumping profiling data. [1369547] 10/1/2024 -- 09:39:01 - (util-profiling-keywords.c:213) <Perf> (SCProfilingKeywordDump) -- Done dumping keyword profiling data. [1369547] 10/1/2024 -- 09:39:01 - (util-profiling-rulegroups.c:257) <Perf> (SCProfilingSghDump) -- Done dumping rulegroup profiling data. [1369547] 10/1/2024 -- 09:39:01 - (detect-engine-build.c:1780) <Info> (SigAddressCleanupStage1) -- cleaning up signature grouping structure... complete
And finally rule_perf.log:
{"timestamp":"2024-01-10T09:39:01.647509+1000","sort":"ticks","rules":[]} {"timestamp":"2024-01-10T09:39:01.647524+1000","sort":"average ticks","rules":[]} {"timestamp":"2024-01-10T09:39:01.647529+1000","sort":"average ticks (match)","rules":[]} {"timestamp":"2024-01-10T09:39:01.647532+1000","sort":"average ticks (no match)","rules":[]} {"timestamp":"2024-01-10T09:39:01.647537+1000","sort":"number of checks","rules":[]} {"timestamp":"2024-01-10T09:39:01.647540+1000","sort":"number of matches","rules":[]} {"timestamp":"2024-01-10T09:39:01.647543+1000","sort":"max ticks","rules":[]}I did some poking around with GDB, and I can confirm:
- The aggregate/merge functions do run both from the code you changed and on thread shutdown, so it appears the new wall-clock timing based logic is OK
profiling_rules_enabled
is on, so I think it should not be empty- the rules profiling counters don't appear to be incremented, as if the
RULE_PROFILING_START/END
macro isn't being called
Unfortunately I couldn't get to the bottom of it with GDB.
Before I reproduced it with this minimal setup I ran the code on 1/2 TB of PCAPs with a full ETPro ruleset, and got the same result so I don't think it is ruleset/traffic related.
Updated by Victor Julien 10 months ago
- Related to Feature #6550: profiling/rules: allow enabling profiling for pcap file runs added
Updated by Jeff Lucovsky 10 months ago
@Simeon Miteff If might be related to issue 6550 (linked to this ticket). Can you check if adding that PR's changes help?
Updated by Simeon Miteff 10 months ago
Jeff Lucovsky wrote in #note-10:
@Simeon Miteff If might be related to issue 6550 (linked to this ticket). Can you check if adding that PR's changes help?
Yes this looked super promising but I tried it and get the same result. I left a comment on the PR here: https://github.com/OISF/suricata/pull/9755#issuecomment-1892891926
Updated by Simeon Miteff 9 months ago
FYI after straightening out some issues with the way I was configuring the build for profiling I was able to make the combination of Victor and Eric's branches work for me.