Bug #6173
openhttp: loss of backward compatibility in HTTP logs from v6 to v7
Description
Resolution of #5320 done in https://github.com/OISF/suricata/pull/8714/commits is causing a set of problems.
First, it breaks backward compatibility of events and users will have to upgrade all data handling for something that was there since ages. I think this could have been avoid.
Second, switching from a set of JSON key: value
to a dictionary {"key": $key, "value": $value}
is going to cause severe issues in lot of data lake. For example, let's take Elasticsearch case. If default template is kept, we are going to have the following issue. On an event with 2 headers at least:
"key: "alpha", "value": "jane" "key": "beta", "value": "robert"
if user want to see if "alpha": "robert" is present then he will naturally do "key:alpha AND value:robert" and this will match on our event because of the way Elasticsearch handle array by default. If Elasticsearch is set up to use nested elements, the match should be correct. But I'm still completely missing how we can do a list of all values for a given header. Even more if we add the third point below.
Third, the key corresponding to header were "normalized" as it was a fixed string. This is not the case anymore as the key are not converted to lower case before being added to 'key'. The lack of conversion to lower case is making sense as this output is trying to match the reality of the transaction as close as possible.
Updated by Andreas Herz over 1 year ago
- Subject changed from Loss of backward compatibility in HTTP log to http: loss of backward compatibility in HTTP logs from v6 to v7
Updated by Victor Julien over 1 year ago
- Related to Bug #5320: Key collisions in HTTP JSON eve-logs added
Updated by Victor Julien over 1 year ago
- Description updated (diff)
For reference, the upgrade guide mentions that the format changed https://docs.suricata.io/en/suricata-7.0.0-rc2/upgrade.html#logging-changes
Updated by Philippe Antoine over 1 year ago
First, it breaks backward compatibility of events and users will have to upgrade all data handling for something that was there since ages.
So, how do you propose to fix #5320 ?
Having content_range
changed to content_range_parsed
in EveHttpLogJSONBasic
?
This is still a backward compatibility breakage (but maybe a lesser one)
Second, switching from a set of JSON key: value to a dictionary {"key": $key, "value": $value} is going to cause severe issues in lot of data lake.
Then, regit is not that a problem with @dump-all-headers
suricata.yaml option ?
(disclaimer: I do not understand everything about Elastic)
The lack of conversion to lower case is making sense as this output is trying to match the reality of the transaction as close as possible.
Not sure I understand if you want normalization or raw data here...
Updated by Eric Leblond over 1 year ago
Philippe Antoine wrote in #note-5:
First, it breaks backward compatibility of events and users will have to upgrade all data handling for something that was there since ages.
So, how do you propose to fix #5320 ?
Havingcontent_range
changed tocontent_range_parsed
inEveHttpLogJSONBasic
?
This is still a backward compatibility breakage (but maybe a lesser one)
Yes, changing one single key would be better. Also there is something we should avoid which is to have a JSON key pointing to different values type. Pointed in the issue was the fact one single key was single value or multiple values. In the case of Elasticsearch it handles it fine but for most "DB" engines or even for coding this is a nightmare.
Second, switching from a set of JSON key: value to a dictionary {"key": $key, "value": $value} is going to cause severe issues in lot of data lake.
Then,
regit is not that a problem with @dump-all-headers
suricata.yaml option ?
(disclaimer: I do not understand everything about Elastic)
For me, dump-all-headers is really nice when you want to investigate one single event or doing forensic as we are not going to miss a single field. The key, value is needed as adding dynamic keys to tools like Elasticsearch is causing an explosion of their engine. But it is not usable for doing queries like give me the rare $name_your_header_key values. I think we need to keep the set of static key we already have as it is really easy to query them and do stats or dashboards.
The lack of conversion to lower case is making sense as this output is trying to match the reality of the transaction as close as possible.
Not sure I understand if you want normalization or raw data here...
In dump all headers, I think it is making sense to have raw data. This is for me a one event at the time or something used in processing.
Updated by Jason Ish over 1 year ago
Can we get some examples of the actual breakage? I'm playing with 6.0.13 and 7.0.0-rc2 with some basic requests, but not too much is jumping out, at least with a default configuration.
I believe the main issue was that listing headers in custom
could overwrite some http
fields as we, arguably in error logged these objects directly into the http
object instead of some sub-object named headers. And now we log these response_headers
. regit am I correct that you see this breakage when using a custom value for @http.custom
?
As the header names allowed in custom
are a defined list, an alternative fix to the issue could be to make sure there is no name overlap by renaming some fields, but still logging them under the http
object.
Then response_headers
and request_headers
could be saved for the dump-all-headers
option.
Within response_headers
and request_headers
what is the best format? What we have now:
"response_headers": [ { "name": "Server", value: "Caddy" } ]
or..
"response_headers": { "Server": ["Caddy"], "Header With Space": ["Testing"] }
where the first may be harder to query, but second causing index expansion and possibly other issues.
Updated by Eric Leblond over 1 year ago
Jason Ish wrote in #note-7:
Can we get some examples of the actual breakage? I'm playing with 6.0.13 and 7.0.0-rc2 with some basic requests, but not too much is jumping out, at least with a default configuration.
So I was trying to get the http server information in the log and I used the configuration to do so:
- http: extended: yes # enable this for extended logging information custom: [server, accept-encoding]
the thing I have with 7 is:
{ "hostname": "198.71.247.91", "url": "/", "http_user_agent": "curl/7.58.0", "http_content_type": "text/html", "http_method": "GET", "protocol": "HTTP/1.1", "status": 200, "length": 51, "response_headers": [ { "name": "Server", "value": "Apache/2.4.41 (Ubuntu)" } ] }
With 6.x I got
{ "hostname": "198.71.247.91", "url": "/", "http_user_agent": "curl/7.58.0", "http_content_type": "text/html", "server": "Apache/2.4.41 (Ubuntu)", "http_method": "GET", "protocol": "HTTP/1.1", "status": 200, "length": 51 }
{ "name": "Server", value: "Caddy" }I believe the main issue was that listing headers in
custom
could overwrite somehttp
fields as we, arguably in error logged these objects directly into thehttp
object instead of some sub-object named headers. And now we log theseresponse_headers
.regit am I correct that you see this breakage when using a custom value for @http.custom
?As the header names allowed in
custom
are a defined list, an alternative fix to the issue could be to make sure there is no name overlap by renaming some fields, but still logging them under thehttp
object.Then
response_headers
andrequest_headers
could be saved for thedump-all-headers
option.Within
response_headers
andrequest_headers
what is the best format? What we have now:
"response_headers": [
]
IMO, this is the way to go for listing of arbitrary keys.
or..
[...]
where the first may be harder to query, but second causing index expansion and possibly other issues.
Yes, this one is going to cause software like Elasticsearch to explode.
Eric Leblond wrote in #note-6:
Philippe Antoine wrote in #note-5:
First, it breaks backward compatibility of events and users will have to upgrade all data handling for something that was there since ages.
So, how do you propose to fix #5320 ?
Havingcontent_range
changed tocontent_range_parsed
inEveHttpLogJSONBasic
?
This is still a backward compatibility breakage (but maybe a lesser one)Yes, changing one single key would be better. Also there is something we should avoid which is to have a JSON key pointing to different values type. Pointed in the issue was the fact one single key was single value or multiple values. In the case of Elasticsearch it handles it fine but for most "DB" engines or even for coding this is a nightmare.
Second, switching from a set of JSON key: value to a dictionary {"key": $key, "value": $value} is going to cause severe issues in lot of data lake.
Then,
regit is not that a problem with @dump-all-headers
suricata.yaml option ?
(disclaimer: I do not understand everything about Elastic)For me, dump-all-headers is really nice when you want to investigate one single event or doing forensic as we are not going to miss a single field. The key, value is needed as adding dynamic keys to tools like Elasticsearch is causing an explosion of their engine. But it is not usable for doing queries like give me the rare $name_your_header_key values. I think we need to keep the set of static key we already have as it is really easy to query them and do stats or dashboards.
The lack of conversion to lower case is making sense as this output is trying to match the reality of the transaction as close as possible.
Not sure I understand if you want normalization or raw data here...
In dump all headers, I think it is making sense to have raw data. This is for me a one event at the time or something used in processing.
Updated by Jason Ish over 1 year ago
Any recommendations on fixing this breaking change or comments on my suggested fix?
Updated by Eric Leblond over 1 year ago
For content_range, I would rename the array version in content_ranges like it is done for ethernet address but proposed content_range_parsed works too.
Updated by Jason Ish over 1 year ago
This PR attempts to provide more notes about the breaking change for upgrades: https://github.com/OISF/suricata/pull/9166