Documentation #6552
closeddoc: add tcp timeout fix to upgrade guide
Description
I have bisected two git commits that have caused large increases in memory use.
This one https://github.com/OISF/suricata/commit/f2e2576ff4 causes an increase in http.memuse.
4.0.6:
http.memuse | Total | 61928
7.0.0
http.memuse | Total | 2698429
This one https://github.com/OISF/suricata/commit/1dac2467c5 causes a massive increase in tcp.reassembly_memuse such that memory use is almost 6x and the reassembly.memcap is hit causing traffic drops (IPS mode) and throughput reduction.
4.0.6
tcp.reassembly_memuse | Total | 11128832
7.0.0
tcp.reassembly_memuse | Total | 63608832
I have not been able to figure out why this increased so much hence I am making this ticket for help.
For these stats I used trex with 1 IP.
sudo ./t-rex-64 --astf -f astf/http_https.py -m 100 -d 230400 -l 100
Files
Updated by Victor Julien 11 months ago
- Status changed from New to Feedback
- Assignee changed from OISF Dev to Thomas Winter
It would be good to get a better test case, ideally pcap based. Also, you could use tools like valgrinds massif to try to analyze where the memory use is exactly.
Updated by Thomas Winter 11 months ago
- File http_https_suri5.pcap http_https_suri5.pcap added
- File massif.out.2298 massif.out.2298 added
- File massif.out.3238 massif.out.3238 added
I've attached massif.out.3238 which is 4.0.6 with ReassembleRealloc at 17.1MB peak.
I've attached massif.out.2298 which is 7.0.0 with StreamTcpReassembleRealloc at 61.7MB peak.
Both were captured with `astf/http_https.py -m 100`.
After flows are closed/timedout, memuse returns to baseline so it is not a leak.
Here are some notes on running trex.
#Download and extract wget --no-cache --no-check-certificate https://trex-tgn.cisco.com/trex/release/latest tar -xzvf latest cd v3.04/ #Create a trex config file cat /etc/trex_cfg.yaml - port_limit : 2 version : 2 #List of interfaces. Change to suit your setup. Use ./dpdk_setup_ports.py -s to see available options #interfaces : ["enp3s0","enp4s0"] interfaces : ["br10","br20"] port_info : # Port IPs. Change to suit your needs. In case of loopback, you can leave as is. - ip : 192.168.10.100 #default_gw : 192.168.10.15 default_gw : 192.168.10.16 - ip : 192.168.20.100 #default_gw : 192.168.20.15 default_gw : 192.168.20.16 #Update traffic to use single client/server IP diff astf/http_https.py.orig astf/http_https.py 15,16c15,16 < ip_gen_c = ASTFIPGenDist(ip_range=["16.0.0.0", "16.0.0.255"], distribution="seq") < ip_gen_s = ASTFIPGenDist(ip_range=["48.0.0.0", "48.0.255.255"], distribution="seq") --- > ip_gen_c = ASTFIPGenDist(ip_range=["192.168.10.100", "192.168.10.100"], distribution="seq") > ip_gen_s = ASTFIPGenDist(ip_range=["192.168.20.100", "192.168.20.100"], distribution="seq") #IPs are on the local interfaces, NICs are connected to the device running suricata acting as router/gateway ip addr 8: br10: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 inet 192.168.10.100/24 brd 192.168.10.255 scope global br10 9: br20: <BROADCAST,MULTICAST,PROMISC,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 inet 192.168.20.100/24 brd 192.168.20.255 scope global br20 #Run the test sudo ./t-rex-64 --astf -f astf/http_https.py -m 250
I've attached a packet capture of a trex run with -m 10 for 10 seconds. Packet capture would be too large otherwise, you would need to set a very low reassembly memcap to see it run out with this one. I suspect any tcp traffic should see the increased memory usage eg http(s) transactions.
Updated by Thomas Winter 11 months ago
So far I have only seen this issue when using trex and another traffic generator.
Running multiple continuous curl to download large files, I do not see such large memory use.
A difference between the curl transactions is that they end with FIN/ACKs. The trex traffic ends in a RST/ACK.
I can use wget and kill the process to do a RST/ACK and see an increase in tcp.reassembly_memuse
while [ 1 ] do wget -o /dev/null -O /dev/null --http-user="user" --http-password="pass" --no-check-certificate http://192.168.10.15/upload_test & wget_pid=$! sleep 1 kill -s SIGTERM "$wget_pid" done
If wget is not interrupted, I do not see the memory use increase as drastically.
Updated by Thomas Winter 11 months ago
The streamingBuffer rework seems to be a red herring. I think it just increased memuse just enough to hit the memcap. There seems to be a large change in usage between 5.0.0 and 6.0.0 - I am performing another git bisect.
Updated by Thomas Winter 11 months ago
The cause was this patch c50ef8cc21 ("flow: fix TCP closed default initialization") which increased the timeout for TCP sessions in the closed state which caused flows to hang around longer causing elevated memuse which previously would have decreased much sooner. I'm not sure why I could only see this for sessions closed by RST and not FIN as both should become closed state.
Applying the following config restores the previous behaviour
flow-timeouts: tcp: closed: 0 emergency_closed: 0
Updated by Victor Julien 11 months ago
- Tracker changed from Bug to Documentation
- Subject changed from Large increase in memory after upgrade to 7.0.0 causing traffic loss due to memcap to doc: add tcp timeout fix to upgrade guide
- Assignee changed from Thomas Winter to Victor Julien
- Priority changed from High to Normal
- Target version changed from TBD to 7.0.3
Thanks Thomas. Switching this to a task to update the upgrading from 6 to 7 guide about this fix and how it can affect memory use.
Updated by Victor Julien 11 months ago
- Status changed from Assigned to In Review
- Label Needs backport to 6.0, Needs backport to 7.0 added
Updated by Shivani Bhardwaj 11 months ago
- Target version changed from 7.0.3 to 8.0.0-beta1
Updated by Thomas Winter 11 months ago
Should use "emergency-closed" not "emergency_closed" otherwise you get a deprecated warning log.
Updated by Jason Ish 2 months ago ยท Edited
- Status changed from In Review to Resolved
Fixed with PR https://github.com/OISF/suricata/pull/9995, commit 3456dea276c209b5bf0f95259a42f89d121ada32.
Updated by Philippe Antoine about 2 months ago
- Status changed from Resolved to Closed