Datajson v1.0 #12102

regit · 2024-11-07T20:39:13Z

Indicator of Compromises (IOCs) are a key element in Security Operating Center. Dataset
have been a huge step in getting alert on IOCs from Suricata. But produced alerts are
lacking contextualization. For example, if a IOC management software has a list of host
names, they will be linked to different threat actors but at ingestion in Suricata they will be
a simple list of strings. This list will be used via a rule like

alert tls any any -> any any (msg:"IOC hostname on TLS"; tls.sni; dataset:isset,hostname.lst,...; sid:1664;)

With this an alert will have a subject without information and a mapping will have to be done
at posteriori to see which IOC has hit. The pseudo algorithm to run is:

Intercept the signature 1664
Extract the tls.sni
Check the tls.sni value in the IOC management software

This works but it is not optimal as correlation and external processing has to be done for all the matches.

To fix this issue, we need to be able to ingest the IOCs without loosing the contextual
information contained in the IOC management software.

Datajson is a proposed implementation that addresses this issue. Instead of injecting into
Suricata the value list we can attach to each value a JSON object that will end up into
the alert output.

The following example is alert on source and destination IP in dataset:

alert tls $HOME_NET any -> any any (msg:"Test dataset";
                              ip.src; datajson:isset,ip4list,type ipv6,load ip4-json.lst,key inventory; \
                              ip.dst; datajson:isset,actors_ip,type ipv6, load bad-json.lst, key bad_actors; sid:1;)

In ip4-json.lst, we have data from inventory:

10.7.5.5,{"user":"vjulien","rank": 1}

In bad-json.lst, we have data from the IOC management software:

185.117.73.76,["Bad Panda"]
144.217.50.240,["killer bear","LSD kitten"]

The result is an alert section that looks like:

  "alert": {
    "action": "allowed",
    "gid": 1,
    "signature_id": 1,
    "rev": 0,
    "signature": "Test dataset",
    "category": "",
    "severity": 3,
    "extra": {
      "inventory": {
        "user": "vjulien",
        "rank": 1
      },
      "bad_actors": [
        "Killer Bear",
        "LSD kitten"
      ]
    }
  },

Contribution style:

I have read the contributing guide lines at
https://docs.suricata.io/en/latest/devguide/contributing/contribution-process.html

Our Contribution agreements:

I have signed the Open Information Security Foundation contribution agreement at
https://suricata.io/about/contribution-agreement/ (note: this is only required once)

Changes (if applicable):

I have updated the User Guide (in doc/userguide/) to reflect the changes made
I have updated the JSON schema (in etc/schema.json) to reflect all logging changes
(including schema descriptions)
I have created a ticket at
https://redmine.openinfosecfoundation.org/projects/suricata/issues

Link to ticket: https://redmine.openinfosecfoundation.org/issues/7372

Describe changes:

Implement datajson keywords for all dataset types
Add documentation
Add unix socket commands

SV_REPO=https://github.com/regit/suricata-verify/
SV_BRANCH=datajson-v1.0

This patch introduces a new keyword datajson that is similar to dataset with a twist. Where dataset allows match from sets, datajson allows the same but also adds JSON data to the alert event. This data is comint from the set definition it self. For example, an ipv4 set will look like: 10.16.1.11,{"test": "success","context":3} The syntax is value and json data separated by a comma. The syntax of the keyword is the following: datajson:isset,src_ip,type ip,load src.lst,key src_ip; Compare to dataset, it just have a supplementary option key that is used to indicate in which subobject the JSON value should be added. The information is added in the even under the alert.extra subobject: "alert": { "extra": { "src_ip": { "test": "success", "context": 3 }, The main interest of the feature is to be able to contextualize a match. For example, if you have an IOC source, you can do value1,{"actor":"APT28","Country":"FR"} value2,{"actor":"APT32","Country":"NL"} This way, a single dataset is able to produce context to the event where it was not possible before and multiple signatures had to be used. Ticket: OISF#7372

Ticket: OISF#7372

Previous code was using an array and introducing a limit in the number of datajson keywords that can be used in a signature. This patch uses a linked list instead to overcome the limit. By using a first element of the list that is part of the structure we limit the cost of the feature to a structure member added to PacketAlert structure. Only the PacketAlertFree function is impacted as we need to iterate to find potential allocation. Ticket: OISF#7372

Ticket: OISF#7372

It was not handling correctly the json values with space as they were seen as multiple arguments. Ticket: OISF#7372

codecov · 2024-11-07T21:11:06Z

Codecov Report

Attention: Patch coverage is 3.28685% with 971 lines in your changes missing coverage. Please review.

Project coverage is 67.43%. Comparing base (278dc24) to head (6061d01).

❗ There is a different number of reports uploaded between BASE (278dc24) and HEAD (6061d01). Click for more details.

HEAD has 2 uploads less than BASE

Flag BASE (278dc24) HEAD (6061d01)

suricata-verify 1 0

unittests 1 0

Additional details and impacted files

@@             Coverage Diff             @@
##           master   #12102       +/-   ##
===========================================
- Coverage   83.23%   67.43%   -15.81%     
===========================================
  Files         906      846       -60     
  Lines      257647   156207   -101440     
===========================================
- Hits       214458   105332   -109126     
- Misses      43189    50875     +7686

Flag	Coverage Δ
fuzzcorpus	`60.88% <2.98%> (-0.33%)`	⬇️
livemode	`19.32% <2.88%> (-0.11%)`	⬇️
pcap	`44.16% <2.68%> (-0.27%)`	⬇️
suricata-verify	`?`
unittests	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

suricata-qa · 2024-11-07T23:55:04Z

Information:

ERROR: QA failed on SURI_TLPW2_autofp_suri_time.

field	baseline	test	%
	SURI_TLPR1_stats_chk
.uptime	636	676	106.29%

Pipeline 23289

victorjulien

Some minor comments.

Bigger issue: I like this work, but I would like to understand why this needs a new set type and keyword type? Can we not overload the existing dataset/datarep facilities?

src/datasets-json.h

src/datasets.c

victorjulien · 2024-11-08T11:19:31Z

src/datasets.c

+    if (set == NULL)
+        return rrep;
+
+    if (data_len != 16 && data_len != 4)


how would we get here with data_len == 4?

GetDataDst and GetDataSrc in detect-ipaddr.c are setting a buffer length of 4 or 16 depending of IP type:

static InspectionBuffer *GetDataDst(DetectEngineThreadCtx *det_ctx, const DetectEngineTransforms *transforms, Packet *p, const int list_id) { InspectionBuffer *buffer = InspectionBufferGet(det_ctx, list_id); if (buffer->inspect == NULL) { if (PacketIsIPv4(p)) { /* Suricata stores the IPv4 at the beginning of the field */ InspectionBufferSetup(det_ctx, list_id, buffer, p->dst.address.address_un_data8, 4); } else if (PacketIsIPv6(p)) { InspectionBufferSetup(det_ctx, list_id, buffer, p->dst.address.address_un_data8, 16);

so we can have the 2 lengths. As Suricata stores the IPv4 at the beginning of the field, this is working.

src/detect.h

suricata-qa · 2024-11-13T21:50:04Z

Information:

ERROR: QA failed on SURI_TLPW2_autofp_suri_time.

field	baseline	test	%
	SURI_TLPW2_autofp_stats_chk
.uptime	139	144	103.6%
	SURI_TLPR1_stats_chk
.uptime	636	670	105.35%

Pipeline 23316

catenacyber · 2024-11-18T12:36:10Z

Did you forget to use OISF/suricata-verify#2123 here ?

jasonish · 2024-11-19T22:46:00Z

etc/schema.json

@@ -212,6 +212,10 @@
                "xff": {
                    "type": "string"
                },
+                "extra": {
+                    "type": "object",


I'm trying to get a description on everything new for documentation purposes, can you add one?

jasonish · 2024-11-19T23:02:16Z

I'm a little hung up on the "extra" object? Is this data much different than metadata? For example, if we were to spin a rule for each entry in dataset, we'd add extra context in the message, or metadata, here we're just adding extra metadata from the dataset.

Also, with "extra" being a very generic variable name, should it be available to possible future items that might want to add extra metadata? For example, a Lua rule that wants to add extra metadata? Or should this be restricted to datajson and such be given a more specific name?

doc/userguide/rules/datasets.rst

jasonish · 2024-11-19T23:11:52Z

Is there a real need for a new dataset type of datajson? Could there just be dataset, and then dataset that has metadata?

regit · 2024-11-23T17:11:47Z

I'm a little hung up on the "extra" object? Is this data much different than metadata? For example, if we were to spin a rule for each entry in dataset, we'd add extra context in the message, or metadata, here we're just adding extra metadata from the dataset.

I will admit I was kinda expecting to have a discussion on the used keyword. I'm not super happy with the choice but I did not find a better one.

I think we need to avoid conflict, people parsing metadata of signatures are kind of expecting to just have that in metadata. So it should be on its own.

Also, with "extra" being a very generic variable name, should it be available to possible future items that might want to add extra metadata? For example, a Lua rule that wants to add extra metadata? Or should this be restricted to datajson and such be given a more specific name?

The concept of extracting information at detection time seems pretty good. Flowbits is currently used for that (lua, pcre extraction) but it is then propagated to all events in the flowbits object. It may be better to offer a per alert way of doing this.

Maybe we can stay with extra (or something generic) in this case and extend the capabilities to other systems.

regit · 2024-11-23T20:20:25Z

Is there a real need for a new dataset type of datajson? Could there just be dataset, and then dataset that has metadata?

I think it is more on educational side. Potentially also a bit keeping the code simple. On the educational/usage side, it allows signature readers to understand really fast what is really done. datajson keyword is rather explicit and different from dataset so user can understand that there will be consequences.

jasonish · 2024-11-23T20:50:29Z

Is there a real need for a new dataset type of datajson? Could there just be dataset, and then dataset that has metadata?

I think it is more on educational side. Potentially also a bit keeping the code simple. On the educational/usage side, it allows signature readers to understand really fast what is really done. datajson keyword is rather explicit and different from dataset so user can understand that there will be consequences.

Can you explain how its different than a dataset that contains additional metadata to attach to the alert? Its almost like datadata would be the correct keyword then. JSON is just the encoding of the data :)

doc/userguide/rules/datasets.rst

src/detect-datajson.c

regit · 2024-11-25T12:29:23Z

Is there a real need for a new dataset type of datajson? Could there just be dataset, and then dataset that has metadata?

I think it is more on educational side. Potentially also a bit keeping the code simple. On the educational/usage side, it allows signature readers to understand really fast what is really done. datajson keyword is rather explicit and different from dataset so user can understand that there will be consequences.

Can you explain how its different than a dataset that contains additional metadata to attach to the alert? Its almost like datadata would be the correct keyword then. JSON is just the encoding of the data :)

Currently, what the code is doing is:

parse value,json by separating at ,
check if json is valid data
- if yes, add entry in hash with value -> string(json) (no transformation here)
when alert fire, add the string(json) directly under the extra subobject without using json dump

So it is really a datajson as we don't encode and reencode at output.

regit added 7 commits November 7, 2024 21:25

detect/dataset: update copyright date

997805d

docs: basic datajson documentation

a67bd9b

Ticket: OISF#7372

datajson: unix commands to add/remove

7672798

Ticket: OISF#7372

docs: document datajson unix-socket commands

b50b4e8

Ticket: OISF#7372

datajson: fix unix socket add command

6061d01

It was not handling correctly the json values with space as they were seen as multiple arguments. Ticket: OISF#7372

regit requested review from jasonish, inashivb, jufajardini, victorjulien and a team as code owners November 7, 2024 20:39

victorjulien requested changes Nov 8, 2024

View reviewed changes

detect/datajson: JSON can be long

45233ec

jasonish reviewed Nov 19, 2024

View reviewed changes

doc/userguide/rules/datasets.rst Show resolved Hide resolved

jasonish reviewed Nov 23, 2024

View reviewed changes

doc/userguide/rules/datasets.rst Show resolved Hide resolved

jasonish reviewed Nov 25, 2024

View reviewed changes

src/detect-datajson.c Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datajson v1.0 #12102

Datajson v1.0 #12102

Datajson v1.0 #12102

Are you sure you want to change the base?

Datajson v1.0 #12102

Conversation

Contribution style:

Our Contribution agreements:

Changes (if applicable):

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment