Bug #3342
closedSuricata 5.0 crashes while parsing SMB data
Description
From a couple of days Suricata 5.0 has been crashing on a regular basis, around the same time. We finally got the core and some backtrace. Let me know what else I can do to help debug it.
Backtrace
(gdb) bt #0 __GI___libc_free (mem=0x7fe28bffa010) at malloc.c:3104 #1 0x00005613ebd127cd in StreamingBufferClear (sb=0x7fe5fe87c420) at util-streaming-buffer.c:392 #2 StreamingBufferClear (sb=0x7fe5fe87c420) at util-streaming-buffer.c:132 #3 0x00005613ebd127e7 in StreamingBufferFree (sb=0x7fe5fe87c420) at util-streaming-buffer.c:148 #4 0x00005613ebcd4ec4 in FileFree (ff=0x7fe5febbcc50) at util-file.c:482 #5 0x00005613ebcd643c in FileContainerRecycle (ffc=0x7fe5fe67d718) at util-file.c:406 #6 0x00005613ebd7ea1c in suricata::filecontainer::FileContainer::free (self=0x7fe5fe67d718) at src/filecontainer.rs:41 #7 suricata::smb::files::SMBFiles::free (self=<optimized out>) at src/smb/files.rs:67 #8 suricata::smb::smb::SMBState::free (self=<optimized out>) at src/smb/smb.rs:838 #9 rs_smb_state_free (state=<optimized out>) at src/smb/smb.rs:1791 #10 0x00005613ebb2ad87 in AppLayerParserStateCleanup (f=f@entry=0x7fe57214e2f0, alstate=<optimized out>, pstate=0x7fe5fe61bbf0) at app-layer-parser.c:1414 #11 0x00005613ebc06983 in FlowCleanupAppLayer (f=0x7fe57214e2f0) at flow.c:130 #12 FlowCleanupAppLayer (f=0x7fe57214e2f0) at flow.c:125 #13 0x00005613ebc08999 in FlowClearMemory (f=f@entry=0x7fe57214e2f0, proto_map=<optimized out>) at flow.c:1041 #14 0x00005613ebc0bafc in FlowRecycler (th_v=0x561401737c90, thread_data=0x7fe564000b60) at flow-manager.c:997 #15 0x00005613ebcaac0c in TmThreadsManagement (td=0x561401737c90) at tm-threads.c:706 #16 0x00007fe62a87987f in start_thread (arg=<optimized out>) at pthread_create.c:479 #17 0x00007fe629d35e03 in clone () at arena.c:290
Went up to find the flow, got the source and destination IP addresses that tell me that's some kind of SMB flow between the domain controller and servers.
(gdb) frame 13 #13 0x00005613ebc08999 in FlowClearMemory (f=f@entry=0x7fe57214e2f0, proto_map=<optimized out>) at flow.c:1041 1041 flow.c: No such file or directory. (gdb) print *f $1 = {src = {address = {address_un_data32 = {489173258, 0, 0, 0}, address_un_data16 = {12554, 7464, 0, 0, 0, 0, 0, 0}, address_un_data8 = "\n1(\035", '\000' <repeats 11 times>}}, dst = {address = {address_un_data32 = {1749364746, 0, 0, 0}, address_un_data16 = {12298, 26693, 0, 0, 0, 0, 0, 0}, address_un_data8 = "\n0Eh", '\000' <repeats 11 times>}}, {sp = 49668, icmp_s = {type = 4 '\004', code = 194 '\302'}}, {dp = 445, icmp_d = {type = 189 '\275', code = 1 '\001'}}, proto = 6 '\006', recursion_level = 0 '\000', vlan_id = {240, 0}, vlan_idx = 1 '\001', livedev = 0x5613ed64e730, flow_hash = 3288764968, lastts = {tv_sec = 1574112362, tv_usec = 320281}, flow_state_sc_atomic__ = 2, use_cnt_sc_atomic__ = 0, tenant_id = 0, probing_parser_toserver_alproto_masks = 0, probing_parser_toclient_alproto_masks = 0, flags = 1651483, file_flags = 4092, protodetect_dp = 0, parent_id = 0, m = { __data = {__lock = 1, __count = 0, __owner = 187548, __nusers = 1, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = "\001\000\000\000\000\000\000\000\234\334\002\000\001", '\000' <repeats 26 times>, __align = 1}, protoctx = 0x7fe5fd8b5b70, protomap = 0 '\000', flow_end_flags = 20 '\024', alproto = 8, alproto_ts = 8, alproto_tc = 8, alproto_orig = 0, alproto_expect = 0, de_ctx_version = 104, thread_id = {2, 2}, min_ttl_toserver = 128 '\200', max_ttl_toserver = 128 '\200', min_ttl_toclient = 64 '@', max_ttl_toclient = 127 '\177', alparser = 0x7fe5fe61bbf0, alstate = 0x7fe5fe67d5c0, sgh_toclient = 0x5613fbe2ddf0, sgh_toserver = 0x5613f3e36460, flowvar = 0x0, hnext = 0x0, hprev = 0x0, fb = 0x7fe625372a40, lnext = 0x0, lprev = 0x0, startts = {tv_sec = 1574111808, tv_usec = 91686}, todstpktcnt = 397335, tosrcpktcnt = 2535584, todstbytecnt = 27513783, tosrcbytecnt = 3818073044}
(gdb) frame 10 #10 0x00005613ebb2ad87 in AppLayerParserStateCleanup (f=f@entry=0x7fe57214e2f0, alstate=<optimized out>, pstate=0x7fe5fe61bbf0) at app-layer-parser.c:1414 1414 app-layer-parser.c: No such file or directory. (gdb) print *f $2 = {src = {address = {address_un_data32 = {489173258, 0, 0, 0}, address_un_data16 = {12554, 7464, 0, 0, 0, 0, 0, 0}, address_un_data8 = "\n1(\035", '\000' <repeats 11 times>}}, dst = {address = {address_un_data32 = {1749364746, 0, 0, 0}, address_un_data16 = {12298, 26693, 0, 0, 0, 0, 0, 0}, address_un_data8 = "\n0Eh", '\000' <repeats 11 times>}}, {sp = 49668, icmp_s = {type = 4 '\004', code = 194 '\302'}}, {dp = 445, icmp_d = {type = 189 '\275', code = 1 '\001'}}, proto = 6 '\006', recursion_level = 0 '\000', vlan_id = {240, 0}, vlan_idx = 1 '\001', livedev = 0x5613ed64e730, flow_hash = 3288764968, lastts = {tv_sec = 1574112362, tv_usec = 320281}, flow_state_sc_atomic__ = 2, use_cnt_sc_atomic__ = 0, tenant_id = 0, probing_parser_toserver_alproto_masks = 0, probing_parser_toclient_alproto_masks = 0, flags = 1651483, file_flags = 4092, protodetect_dp = 0, parent_id = 0, m = { __data = {__lock = 1, __count = 0, __owner = 187548, __nusers = 1, __kind = 0, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = "\001\000\000\000\000\000\000\000\234\334\002\000\001", '\000' <repeats 26 times>, __align = 1}, protoctx = 0x7fe5fd8b5b70, protomap = 0 '\000', flow_end_flags = 20 '\024', alproto = 8, alproto_ts = 8, alproto_tc = 8, alproto_orig = 0, alproto_expect = 0, de_ctx_version = 104, thread_id = {2, 2}, min_ttl_toserver = 128 '\200', max_ttl_toserver = 128 '\200', min_ttl_toclient = 64 '@', max_ttl_toclient = 127 '\177', alparser = 0x7fe5fe61bbf0, alstate = 0x7fe5fe67d5c0, sgh_toclient = 0x5613fbe2ddf0, sgh_toserver = 0x5613f3e36460, flowvar = 0x0, hnext = 0x0, hprev = 0x0, fb = 0x7fe625372a40, lnext = 0x0, lprev = 0x0, startts = {tv_sec = 1574111808, tv_usec = 91686}, todstpktcnt = 397335, tosrcpktcnt = 2535584, todstbytecnt = 27513783, tosrcbytecnt = 3818073044}
src 10.49.40.29 -> dst 10.48.69.104 so a connection was initiated from a server to the AD controller.
Updated by Michal Purzynski almost 5 years ago
What I see around this time (with Zeek's flow logs and parsed SMB data) is
- there's a couple of connections between 150 - 500 seconds
- two connections over 500 seconds have a very high bytes count - over 3.4GB transferred
Connections are multiplexed - there's a long-running TCP connection with multiple SMB operations over it. I can count at least 144 operations.
Out of these 144 SMB operations per one TCP connection
- most are in the range of tens of KBs to 200KBs
- there's at least one that literally transfers entire Windows image - 3669480484 bytes of it
"Operating Systems\Windows 10 Enterprise x64\Sources\install.wim"
Zeek has no problem tracing that. My guess is that there's some problem with the memory management in conditions like these.
Updated by Victor Julien almost 5 years ago
- Status changed from New to Assigned
- Assignee set to Victor Julien
Thanks Michal. This crash happens during cleanup after the flow has timed out. I strongly suspect the cause of this is elsewhere in the form of some memory corruption. Are you able to rebuild suricata with ASAN enabled and run that for a while?
Updated by Michal Purzynski almost 5 years ago
Victor Julien wrote:
Thanks Michal. This crash happens during cleanup after the flow has timed out. I strongly suspect the cause of this is elsewhere in the form of some memory corruption. Are you able to rebuild suricata with ASAN enabled and run that for a while?
Here's the crash message, looks like a segfault. Double free?
[1363579.695283] FR#01187548: segfault at 7fe28bffa008 ip 00007fe629cab745 sp 00007fe5acfe87d0 error 4 cpu 0 in libc-2.30.so[7fe629c2e000+188000]
[1363579.695298] Code: 00 0f 1f 40 00 f3 0f 1e fa 48 83 ec 18 48 8b 05 b9 77 15 00 48 8b 00 48 85 c0 0f 85 8d 00 00 00 0f 1f 44 00 00 48 85 ff 74 7b <48> 8b 47 f8 48 8d 77 f0 a8 02 75 3f 2e 2e 2e 48 8b 15 0d 76 15 00
Glibc's sources tell me we're crashing in
void
__libc_free (void mem)
(...)
if (chunk_is_mmapped (p)) / release mmapped memory. */
I'll rebuild Suri with ASAN.
Updated by Michal Purzynski almost 5 years ago
After 30+ hours Suricata crashed again, here's what ASAN reported (not much, to be honest)
AddressSanitizer:DEADLYSIGNAL ================================================================= ==258009==ERROR: AddressSanitizer: SEGV on unknown address 0x7ff3dfee07f0 (pc 0x7ff813bf8dd6 bp 0x7ff3dfee07f0 sp 0x7ff6d5afed40 T12) ==258009==The signal is caused by a WRITE memory access. #0 0x7ff813bf8dd5 (/usr/lib64/libasan.so.5+0x2bdd5) #1 0x7ff813ce207c in free (/usr/lib64/libasan.so.5+0x11507c) #2 0x55ace165835d in StreamingBufferClear /home/clear/tmp/suricata/src/util-streaming-buffer.c:139 #3 0x55ace1658381 in StreamingBufferFree /home/clear/tmp/suricata/src/util-streaming-buffer.c:148 #4 0x55ace15d52a1 in FileFree /home/clear/tmp/suricata/src/util-file.c:482 #5 0x55ace15d7d8d in FileContainerRecycle /home/clear/tmp/suricata/src/util-file.c:406 #6 0x55ace170711b in suricata::filecontainer::FileContainer::free::ha1ce6bb9d240ac75 src/filecontainer.rs:41 #7 0x55ace170711b in suricata::smb::files::SMBFiles::free::hc5b12d3b0c895ccf src/smb/files.rs:67 #8 0x55ace170711b in suricata::smb::smb::SMBState::free::had064a84c059df8f src/smb/smb.rs:838 #9 0x55ace170711b in rs_smb_state_free src/smb/smb.rs:1791 #10 0x55ace1220603 in AppLayerParserStateCleanup /home/clear/tmp/suricata/src/app-layer-parser.c:1414 #11 0x55ace14056f7 in FlowCleanupAppLayer /home/clear/tmp/suricata/src/flow.c:130 #12 0x55ace14094f6 in FlowClearMemory /home/clear/tmp/suricata/src/flow.c:1041 #13 0x55ace1414e1c in FlowRecycler /home/clear/tmp/suricata/src/flow-manager.c:997 #14 0x55ace15896d5 in TmThreadsManagement /home/clear/tmp/suricata/src/tm-threads.c:706 #15 0x7ff8139fe87e (/usr/lib64/libpthread.so.0+0x987e) #16 0x7ff81359ae02 in clone (/usr/lib64/haswell/libc.so.6+0x12ce02) AddressSanitizer can not provide additional info. SUMMARY: AddressSanitizer: SEGV (/usr/lib64/libasan.so.5+0x2bdd5) Thread T12 (FR#01) created by T0 (Suricata-Main) here: #0 0x7ff813c0caaf in __interceptor_pthread_create (/usr/lib64/libasan.so.5+0x3faaf) #1 0x55ace158bf06 in TmThreadSpawn /home/clear/tmp/suricata/src/tm-threads.c:1868 #2 0x55ace14187d1 in FlowRecyclerThreadSpawn /home/clear/tmp/suricata/src/flow-manager.c:1076 #3 0x55ace14f897f in RunModeDispatch /home/clear/tmp/suricata/src/runmodes.c:388 #4 0x55ace157a49d in main /home/clear/tmp/suricata/src/suricata.c:3065 #5 0x7ff8134954a1 in __libc_start_main (/usr/lib64/haswell/libc.so.6+0x274a1) ==258009==ABORTING
Updated by Victor Julien almost 5 years ago
- Target version changed from 5.0.0 to 5.0.1
- Affected Versions 5.0.0 added
Updated by Victor Julien almost 5 years ago
Hi Michal, are you capturing the traffic so you can see if you can reproduce it on the pcap?
Updated by Victor Julien almost 5 years ago
- Target version changed from 5.0.1 to 5.0.2
Updated by Victor Julien over 4 years ago
Michal, are you still seeing this with the current master or master-5.0.x branch?
Updated by Victor Julien over 4 years ago
- Target version changed from 5.0.2 to 5.0.3
Updated by Michal Purzynski over 4 years ago
Victor Julien wrote:
Michal, are you still seeing this with the current master or master-5.0.x branch?
I can test on -current next week.
So far I just filtered out that traffic that was causing crashes and haven't seen anything bad since.
Updated by Michal Purzynski over 4 years ago
Michal Purzynski wrote in #note-1:
What I see around this time (with Zeek's flow logs and parsed SMB data) is
- there's a couple of connections between 150 - 500 seconds
- two connections over 500 seconds have a very high bytes count - over 3.4GB transferredConnections are multiplexed - there's a long-running TCP connection with multiple SMB operations over it. I can count at least 144 operations.
Out of these 144 SMB operations per one TCP connection
- most are in the range of tens of KBs to 200KBs
- there's at least one that literally transfers entire Windows image - 3669480484 bytes of it
"Operating Systems\Windows 10 Enterprise x64\Sources\install.wim"Zeek has no problem tracing that. My guess is that there's some problem with the memory management in conditions like these.
Michal Purzynski wrote in #note-11:
Victor Julien wrote:
Michal, are you still seeing this with the current master or master-5.0.x branch?
I can test on -current next week.
So far I just filtered out that traffic that was causing crashes and haven't seen anything bad since.
Just deployed Suricata-current (today's git) and am running it with no filters, should have an update within 48h
Suricata 6.0.0-dev (c5cee0516 2020-03-11)
suricata 1491903 1 99 01:13 pts/0 00:03:21 /opt/suricata-current/bin/suricata -vvv -c /etc/suricata/suricata.yaml.local.debug --af-packet
suricata 1491904 1 99 01:13 pts/0 00:04:45 /opt/suricata-current/bin/suricata -vvv -c /etc/suricata/suricata.yaml.remote.debug --af-packet
Updated by Victor Julien over 4 years ago
Hi Michal, are you able provide an update on this?
Updated by Michal Purzynski over 4 years ago
Terrific news - Suricata master has been stable for the past 4+ weeks. It might be I cannot reproduce it anymore.
Updated by Victor Julien over 4 years ago
- Status changed from Assigned to Closed
- Priority changed from Urgent to Normal
Great, thanks for update Michal!