Bug #3705
openVarNameStoreLookupById: Assertion `!(current == ((void *)0))'
Description
We're trying to use the multi-tenant support in Suricata. In our testsetup we were not having any problems, but trying to load a multi-tenant configuration (both device and vlan based) on a system that is monitoring a relatively busy link we get this error (often within a few minutes):
[May 18 14:48:38] 18/5/2020 -- 14:48:38 - <Notice> - rule reload complete
[May 18 14:48:38] 18/5/2020 -- 14:48:38 - <Notice> - Signature(s) loaded, Detect thread(s) activated.
[May 18 14:52:47] suricata: util-var-name.c:300: VarNameStoreLookupById: Assertion `!(current == ((void *)0))' failed.
It seems related to https://redmine.openinfosecfoundation.org/issues/2518 only we're not trying to perform a reload. We've tried both vlan and device based multi-tenant configurations. We tried setting "loaders" to 1 and 3. We tried "default" on and off. But this always happens.
As soon as we disable multi-select everything runs fine.
Updated by Justin Ossevoort over 4 years ago
I've run Suricata in gdb with a breakpoint on the `g_varnamestore_current_sc` global variable
(gdb) watch g_varnamestore_current_sc_atomic__ Hardware watchpoint 1: g_varnamestore_current_sc_atomic__ (gdb) run Starting program: /usr/bin/suricata --af-packet=mon3 [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". 19/5/2020 -- 09:00:37 - <Notice> - This is Suricata version 5.0.3 RELEASE running in SYSTEM mode [New Thread 0x7fffee069700 (LWP 683)] Thread 1 "Suricata-Main" hit Hardware watchpoint 1: g_varnamestore_current_sc_atomic__ Old value = (VarNameStore *) 0x0 New value = (VarNameStore *) 0x7fffe804f6c0 0x00005555557d05bf in VarNameStoreActivateStaging () at util-var-name.c:348 348 in util-var-name.c (gdb) c Continuing. [New Thread 0x7fffdcd1f700 (LWP 684)] [New Thread 0x7fffc7fff700 (LWP 685)] [New Thread 0x7fffc77fe700 (LWP 686)] [New Thread 0x7fffc6ffd700 (LWP 687)] [New Thread 0x7fffc67fc700 (LWP 688)] [New Thread 0x7fffc5ffb700 (LWP 689)] [New Thread 0x7fffc57fa700 (LWP 690)] [New Thread 0x7fffc4ff9700 (LWP 691)] [New Thread 0x7fff8ffff700 (LWP 692)] 19/5/2020 -- 09:02:44 - <Notice> - all 4 packet processing threads, 4 management threads initialized, engine started. 19/5/2020 -- 09:02:44 - <Notice> - rule reload starting Thread 1 "Suricata-Main" hit Hardware watchpoint 1: g_varnamestore_current_sc_atomic__ Old value = (VarNameStore *) 0x7fffe804f6c0 New value = (VarNameStore *) 0x0 0x00005555557d06c7 in VarNameStoreFree (de_ctx_version=99) at util-var-name.c:385 385 in util-var-name.c (gdb) bt #0 0x00005555557d06c7 in VarNameStoreFree (de_ctx_version=99) at util-var-name.c:385 #1 0x0000555555618649 in DetectEngineCtxFree (de_ctx=0x55555ecf7a40) at detect-engine.c:2148 #2 0x000055555562351b in DetectEngineCtxFree (de_ctx=<optimized out>) at detect-engine.c:4025 #3 DetectEnginePruneFreeList () at detect-engine.c:4025 #4 0x0000555555623951 in DetectEngineReload (suri=<optimized out>) at detect-engine.c:4117 #5 0x000055555557dd8a in PostRunStartedDetectSetup (suri=0x555555ad0000 <suricata>) at suricata.c:2600 #6 main (argc=<optimized out>, argv=<optimized out>) at suricata.c:3092 (gdb) c Continuing. 19/5/2020 -- 09:09:36 - <Notice> - rule reload complete 19/5/2020 -- 09:09:36 - <Notice> - Signature(s) loaded, Detect thread(s) activated. suricata: util-var-name.c:300: VarNameStoreLookupById: Assertion `!(current == ((void *)0))' failed. Thread 6 "W#04-mon3" received signal SIGABRT, Aborted. [Switching to Thread 0x7fffc6ffd700 (LWP 687)] 0x00007ffff6376781 in raise () from /lib/x86_64-linux-gnu/libc.so.6
Updated by Justin Ossevoort over 4 years ago
I'm not claiming any fundamental understanding of the codebase, but it seems the following is happening:
- Suricata starts up and instantiates a `DetectEngineCtx`, which in turn initializes the "VarNameStore"
- After initial startup it triggers a load reload (probably so that loading and reloading use the same codepaths)
- In `DetectEngineReload` a new `DetectEngineCtx` is created (which doesn't initialize a new "VarNameStore" as one is already present)
- It swaps the old and new context in `DetectEngineReloadThreads`
- Afterwards it triggers the cleanup of the old `DetectEngineCtx` using `DetectEngineCtxFree` in `DetectEnginePruneFreeList`, which in turn calls `VarNameStoreFree` which cleans up the current "VarNameStore"
Now we have new instances of the `DetectEngineCtx` running without a current "VarNameStore", which leads to us triggering the assertion.
My first guess would be that we need either a refcount for the "VarNameStore" to keep it alive across reloads or we need to perform a swap/handover during reload and a flag in the `DetectEngineCtx` that indicates if it owns the "VarNameStore".
Updated by Justin Ossevoort over 4 years ago
Justin Ossevoort wrote in #note-2:
My first guess would be that we need either a refcount for the "VarNameStore" to keep it alive across reloads or we need to perform a swap/handover during reload and a flag in the `DetectEngineCtx` that indicates if it owns the "VarNameStore".
Ah, it appears the part I was missing is the `de_ctx_version` logic, which acts effectively what I meant with the handover logic. It should only free the current "VarNameStore" if the version being freed matches the version being active.
This is possibly where the multi-tenant shenanigans come in.
Updated by Justin Ossevoort over 4 years ago
I've been running Suricata in gdb some more:
- Around reload the new `DetectEngineCtx` allocate a new "VarNameStore", which is put in the `g_varnamestore_staging` global variable
- When `VarNameStoreFree` (after the reload when cleaning up the old `DetectEngineCtx`) the new "VarNameStore" is still in the `g_varnamestore_staging` variable en the one in `g_varnamestore_current` matches the old `de_ctx_version` that is being cleaned up, so a NULL pointer is put in it's place.
Looking at the code I suspect in between these two lines someone should have called `VarNameStoreActivateStaging` (which puts `g_varnamestore_current` in `g_varnamestore_old` and `g_varnamestore_staging` in `g_varnamestore_current`).
This function is currently called in only two places:
- `detect-engine-build.c`: At the end of `SigGroupBuild` there is:
if (!DetectEngineMultiTenantEnabled()) { VarNameStoreActivateStaging(); }
- `detect-engine.c`: At the end of `DetectEngineMultiTenantSetup`
I suspect the first codepath is why the not-multi-tenant reload is working. The second callsite is only triggered at startup, but not around reloads.
I'm starting to think this is exactly https://redmine.openinfosecfoundation.org/issues/2518 as this problem is around reloads and I can't seem to find any logic that would handle multi-tenant reloading in the reload logic and consequently call the `VarNameStoreActivateStaging` for this case.
Updated by Justin Ossevoort over 4 years ago
Some further puzzle pieces:
- The `DetectEngineReload function appears to want only support non DETECT_ENGINE_MT_STUB type detect engines
- In `detect-engine.c` there is a function `DetectEngineMTApply` that contains parts of the logic from `DetectEngineReload` that also deal with `DETECT_ENGINE_MT_STUB` detect engine types, but this method appears to never be called in the normal reload logic, only around unix socket commands
Updated by Justin Ossevoort over 4 years ago
It appears I've zeroed in on the problem:
As before it appears that `DetectEngineReload` function doesn't support DETECT_ENGINE_MT_STUB, so why is it actually being called in the first place? And it should have short-circuited in the case of DETECT_ENGINE_MT_STUB, so why is it being called and runs to completion?
Well according to the breakpoint I set right before the check:
(gdb) print old_de_ctx->type $9 = DETECT_ENGINE_TYPE_DD_STUB
After a little more digging I found this is used to implement `detect.delayed-detect` (which is an option we've always had enabled). After disabling this option the entire `DetectEngineReload` is never called and Suricata starts succesfully.
But triggering a reload using the command channel will indeed trigger the same assertion so the `reload-rules` is indeed still broken (aka: https://redmine.openinfosecfoundation.org/issues/2518)
So the current state is:
- Bug: `multi-detect` can not be combined with `detect.delayed-detect` (leading to the earlier assertion in "VarNameStore"); It might be enough to declare this unsupported and reorder the check in `PostConfLoadedDetectSetup` so that `multi-detect` takes precedence over `detect.delayed-detect`
- Bug: `reload-rules` can not be used with `multi-detect` (again leading to the earlier assertion in "VarNameStore"); This probably needs a special case in `DetectEngineReload` and could possible use loop over the tenants and `DetectEngineMTApply` for activation.
- I feel there is still some strange corner case around `multi-detect` and `default: true` that should be handled, but I don't understand how that is supposed to work in the first place