Project

General

Profile

Actions

Bug #6173

open

http: loss of backward compatibility in HTTP logs from v6 to v7

Added by Eric Leblond over 1 year ago. Updated over 1 year ago.

Status:
New
Priority:
Normal
Assignee:
Target version:
Affected Versions:
Effort:
Difficulty:
Label:

Description

Resolution of #5320 done in https://github.com/OISF/suricata/pull/8714/commits is causing a set of problems.

First, it breaks backward compatibility of events and users will have to upgrade all data handling for something that was there since ages. I think this could have been avoid.

Second, switching from a set of JSON key: value to a dictionary {"key": $key, "value": $value} is going to cause severe issues in lot of data lake. For example, let's take Elasticsearch case. If default template is kept, we are going to have the following issue. On an event with 2 headers at least:

"key: "alpha", "value": "jane" 
"key": "beta", "value": "robert" 

if user want to see if "alpha": "robert" is present then he will naturally do "key:alpha AND value:robert" and this will match on our event because of the way Elasticsearch handle array by default. If Elasticsearch is set up to use nested elements, the match should be correct. But I'm still completely missing how we can do a list of all values for a given header. Even more if we add the third point below.

Third, the key corresponding to header were "normalized" as it was a fixed string. This is not the case anymore as the key are not converted to lower case before being added to 'key'. The lack of conversion to lower case is making sense as this output is trying to match the reality of the transaction as close as possible.


Related issues 1 (0 open1 closed)

Related to Suricata - Bug #5320: Key collisions in HTTP JSON eve-logsClosedPhilippe AntoineActions
Actions #1

Updated by Andreas Herz over 1 year ago

  • Affected Versions 7.0.0-rc2 added
Actions #2

Updated by Andreas Herz over 1 year ago

  • Subject changed from Loss of backward compatibility in HTTP log to http: loss of backward compatibility in HTTP logs from v6 to v7
Actions #3

Updated by Victor Julien over 1 year ago

  • Related to Bug #5320: Key collisions in HTTP JSON eve-logs added
Actions #4

Updated by Victor Julien over 1 year ago

  • Description updated (diff)

For reference, the upgrade guide mentions that the format changed https://docs.suricata.io/en/suricata-7.0.0-rc2/upgrade.html#logging-changes

Actions #5

Updated by Philippe Antoine over 1 year ago

First, it breaks backward compatibility of events and users will have to upgrade all data handling for something that was there since ages.

So, how do you propose to fix #5320 ?
Having content_range changed to content_range_parsed in EveHttpLogJSONBasic ?
This is still a backward compatibility breakage (but maybe a lesser one)

Second, switching from a set of JSON key: value to a dictionary {"key": $key, "value": $value} is going to cause severe issues in lot of data lake.

Then, regit is not that a problem with @dump-all-headers suricata.yaml option ?
(disclaimer: I do not understand everything about Elastic)

The lack of conversion to lower case is making sense as this output is trying to match the reality of the transaction as close as possible.

Not sure I understand if you want normalization or raw data here...

Actions #6

Updated by Eric Leblond over 1 year ago

Philippe Antoine wrote in #note-5:

First, it breaks backward compatibility of events and users will have to upgrade all data handling for something that was there since ages.

So, how do you propose to fix #5320 ?
Having content_range changed to content_range_parsed in EveHttpLogJSONBasic ?
This is still a backward compatibility breakage (but maybe a lesser one)

Yes, changing one single key would be better. Also there is something we should avoid which is to have a JSON key pointing to different values type. Pointed in the issue was the fact one single key was single value or multiple values. In the case of Elasticsearch it handles it fine but for most "DB" engines or even for coding this is a nightmare.

Second, switching from a set of JSON key: value to a dictionary {"key": $key, "value": $value} is going to cause severe issues in lot of data lake.

Then, regit is not that a problem with @dump-all-headers suricata.yaml option ?
(disclaimer: I do not understand everything about Elastic)

For me, dump-all-headers is really nice when you want to investigate one single event or doing forensic as we are not going to miss a single field. The key, value is needed as adding dynamic keys to tools like Elasticsearch is causing an explosion of their engine. But it is not usable for doing queries like give me the rare $name_your_header_key values. I think we need to keep the set of static key we already have as it is really easy to query them and do stats or dashboards.

The lack of conversion to lower case is making sense as this output is trying to match the reality of the transaction as close as possible.

Not sure I understand if you want normalization or raw data here...

In dump all headers, I think it is making sense to have raw data. This is for me a one event at the time or something used in processing.

Actions #7

Updated by Jason Ish over 1 year ago

Can we get some examples of the actual breakage? I'm playing with 6.0.13 and 7.0.0-rc2 with some basic requests, but not too much is jumping out, at least with a default configuration.

I believe the main issue was that listing headers in custom could overwrite some http fields as we, arguably in error logged these objects directly into the http object instead of some sub-object named headers. And now we log these response_headers. regit am I correct that you see this breakage when using a custom value for @http.custom?

As the header names allowed in custom are a defined list, an alternative fix to the issue could be to make sure there is no name overlap by renaming some fields, but still logging them under the http object.

Then response_headers and request_headers could be saved for the dump-all-headers option.

Within response_headers and request_headers what is the best format? What we have now:

"response_headers": [
  { "name": "Server", value: "Caddy" }
]

or..

"response_headers": {
  "Server": ["Caddy"],
  "Header With Space": ["Testing"]
}

where the first may be harder to query, but second causing index expansion and possibly other issues.

Actions #8

Updated by Eric Leblond over 1 year ago

Jason Ish wrote in #note-7:

Can we get some examples of the actual breakage? I'm playing with 6.0.13 and 7.0.0-rc2 with some basic requests, but not too much is jumping out, at least with a default configuration.

So I was trying to get the http server information in the log and I used the configuration to do so:

        - http:
            extended: yes     # enable this for extended logging information
            custom: [server, accept-encoding]

the thing I have with 7 is:

{
  "hostname": "198.71.247.91",
  "url": "/",
  "http_user_agent": "curl/7.58.0",
  "http_content_type": "text/html",
  "http_method": "GET",
  "protocol": "HTTP/1.1",
  "status": 200,
  "length": 51,
  "response_headers": [
    {
      "name": "Server",
      "value": "Apache/2.4.41 (Ubuntu)" 
    }
  ]
}

With 6.x I got

{
  "hostname": "198.71.247.91",
  "url": "/",
  "http_user_agent": "curl/7.58.0",
  "http_content_type": "text/html",
  "server": "Apache/2.4.41 (Ubuntu)",
  "http_method": "GET",
  "protocol": "HTTP/1.1",
  "status": 200,
  "length": 51
}

I believe the main issue was that listing headers in custom could overwrite some http fields as we, arguably in error logged these objects directly into the http object instead of some sub-object named headers. And now we log these response_headers. regit am I correct that you see this breakage when using a custom value for @http.custom?

As the header names allowed in custom are a defined list, an alternative fix to the issue could be to make sure there is no name overlap by renaming some fields, but still logging them under the http object.

Then response_headers and request_headers could be saved for the dump-all-headers option.

Within response_headers and request_headers what is the best format? What we have now:
"response_headers": [

{ "name": "Server", value: "Caddy" }
]

IMO, this is the way to go for listing of arbitrary keys.

or..

[...]

where the first may be harder to query, but second causing index expansion and possibly other issues.

Yes, this one is going to cause software like Elasticsearch to explode.

Eric Leblond wrote in #note-6:

Philippe Antoine wrote in #note-5:

First, it breaks backward compatibility of events and users will have to upgrade all data handling for something that was there since ages.

So, how do you propose to fix #5320 ?
Having content_range changed to content_range_parsed in EveHttpLogJSONBasic ?
This is still a backward compatibility breakage (but maybe a lesser one)

Yes, changing one single key would be better. Also there is something we should avoid which is to have a JSON key pointing to different values type. Pointed in the issue was the fact one single key was single value or multiple values. In the case of Elasticsearch it handles it fine but for most "DB" engines or even for coding this is a nightmare.

Second, switching from a set of JSON key: value to a dictionary {"key": $key, "value": $value} is going to cause severe issues in lot of data lake.

Then, regit is not that a problem with @dump-all-headers suricata.yaml option ?
(disclaimer: I do not understand everything about Elastic)

For me, dump-all-headers is really nice when you want to investigate one single event or doing forensic as we are not going to miss a single field. The key, value is needed as adding dynamic keys to tools like Elasticsearch is causing an explosion of their engine. But it is not usable for doing queries like give me the rare $name_your_header_key values. I think we need to keep the set of static key we already have as it is really easy to query them and do stats or dashboards.

The lack of conversion to lower case is making sense as this output is trying to match the reality of the transaction as close as possible.

Not sure I understand if you want normalization or raw data here...

In dump all headers, I think it is making sense to have raw data. This is for me a one event at the time or something used in processing.

Actions #9

Updated by Jason Ish over 1 year ago

Any recommendations on fixing this breaking change or comments on my suggested fix?

Actions #10

Updated by Eric Leblond over 1 year ago

For content_range, I would rename the array version in content_ranges like it is done for ethernet address but proposed content_range_parsed works too.

Actions #11

Updated by Jason Ish over 1 year ago

This PR attempts to provide more notes about the breaking change for upgrades: https://github.com/OISF/suricata/pull/9166

Actions

Also available in: Atom PDF