Feature #2303
closedfile-store enhancements (aka file-store v2): deduplication; hash-based naming; json metadata and cleanup tooling
Description
At Suricon 2017 enhancement around file-store were discussed that are better implemented as a file-store v2 rather than adding more options to the current file store.
Deduplication
Log only one instance of each file. This can be done by using a hash based naming scheme. SHA-256 will be used. When Suricata has determined that the file will be closed, it will be renamed to the SHA-256 of its content.
A directory scheme that uses the hash will be used. This will be a directory of 256 entries, 00 to ff and each file will be put into the directory that matches the first 2 characters of its SHA-256.
When renaming a file to its SHA-256, if it already exists it will simply be "touched" and the working copy deleted.
Deduplication will not occur for the metadata files as its still useful to track each occurrence of a file. Perhaps the meta data files could following a naming scheme like SHA256.<timestamp>.json?
This also removes the need for a waldo file.
Metadata as JSON
The metadata should be logged as JSON in a similar format to that of the fileinfo record.
Cleanup Tool
Introduce a core supported tool for clean up of extracted files that could be run interactively or via cron. Options should exist to delete files older than some duration, or to enforce a certain size on disk. Python could be used here as we have existing tooling in Python.
Related Tickets
https://redmine.openinfosecfoundation.org/issues/1201
https://redmine.openinfosecfoundation.org/issues/1949