20.7 New Column Additions To SAT, and The HASHDIFF
20.7 New Column Additions To SAT, and The HASHDIFF
20.7 New Column Additions To SAT, and The HASHDIFF
When Adding new columns on Data Vault Model, the recommended best
practice is to create a new satellite to handle the new columns rather than alter the
existing satellite and recalculate all of the hash diffs. The question was what to do
with new descriptive columns arriving in the data set.
In the case of columns being deprecated from the data set and perhaps replaced by
new, the load process will insert a null value into the column and should continue
operating without modification or refactoring. The new columns would spawn a new
satellite with a completely different hash diff value based on a completely different
load process.
None of the old code to load the original satellite should need to be touched.
Otherwise, you will have a bit of refactoring to do, perhaps to the source to stage or
secondary stage process and/or target object.
Because if you don’t you’ll end up with duplicate descriptive records in the satellite as
a result of the new columns. Let’s say that the existing satellite version 1 contains 4
columns as follows:
Col1 = A
Col2 = B
Col3 = C
Col4 = D
HashDiff of version 1 of the satellite - Col1 + Col2 + Col3 + Col4 = ‘A’ + ‘B’ + ‘C’ + ‘D’
Now, let’s alter Satellite version 1 and add 3 new columns as Satellite version 2:
Col1 = A
Col2 = B
Col3 = C
Col4 = D
Col5 = NULL
Col6 = NULL
Col7 = NULL
HashDiff of version 1 of the satellite - Col1 + Col2 + Col3 + Col4 + Col5 + Col6 + Col7 =
‘A’ + ‘B’ + ‘C’ + ‘D’ + ‘empty string’ + ‘empty string’ + ‘empty string’
Are the hash diffs the same between version 1 and version 2?
So even though the values for previous rows in the satellite remain the same and have
been altered to include these new columns that most likely contain a NULL value now,
when those rows are processed at some moment in the future, they will result in
a completely different hashdiff compute - and as a result, will be inserted into
the satellite as if they are deltas1.
Selecting against this satellite on the same primary key and pulling back history
over time will result in what appears to be duplicate descriptive rows because
you have two rows both with NULLs in these additional columns. This will
1
Load patterns
Depending on the nature and frequency of the source data, there are different load patterns that can be applied to a data
vault. Full load is the simplest pattern, where the entire source data is loaded into the data vault every time. This is suitable
for small or static data sets that do not change often. Delta load is the most common pattern, where only the new or
changed data is loaded into the data vault. This is suitable for large or dynamic data sets that change frequently and can be
further divided into insert-only, update-only, or insert-update modes
introduce mistrust in the data among your consumers. The last thing you want is
consumers to question the trustworthiness of your data.
The easiest, most accurate, and efficient means to deal with these additional columns,
is to create a new satellite. This also reduces the risk of introducing an error into
the existing load process to the existing satellite.
Why would the hashdiffs have to be recomputed if the satellite were simply altered to
include the new columns?
The answer is because if you don’t recompute the hashdiffs, then you’d end up with
potentially multiple duplicate rows being inserted into the satellite.
Conclusion.
If you choose to ALTER the table, then you must recompute the hashdiffs. My
comment here is that this is not an acceptable practice, this approach
introduces conditional logic into the hash string compute and/or the load
process. Either the columns exist, whether they have data or not; If they exist,
then they have to be accounted for in a consistent, repeatable fashion
regardless of the data value stored in them.
If you choose to follow the recommended best practice, then you create a new
satellite to absorb the changes to the source system without having to refactor
the Raw Vault model - which is optimal for flexibility, scalability, and resilience,
not to mention reduced risk of introducing errors into the existing load process
and any downstream outcomes, processes, views, etc., that may include the
satellite that was altered. It reduces the resources required to implement the
change including the cost and time for regression testing everything that
touches that data flow.
Final Recommendations.
This approach preserves the strength of the SAT as an audit record as it is NEVER
modified