remove time slots and convert this transcript into beatiful notes that I can always
refer before attending the interview.
"This is Vive In this
0:15
video we will explore data
0:17
classification in Informatica Cloud Data
0:19
Governance and catalog CDGC Let's have a
0:23
quick look at what we will be covering
0:25
in this video
0:34
Data classification is the process of
0:37
identifying and tagging the data into
0:40
relevant categories based on metadata
0:43
and or the content of the
0:46
columns Data classification can be
0:49
designed to work in three ways Using the
0:52
metadata of the table or column using
0:55
the data of the column or using the
0:57
combination of both metadata and data
1:01
Classifying the data helps organization
1:04
manage risks ensure compliance and
1:07
enhance data security For example data
1:10
classification can help to identify
1:13
where personally identifiable
1:15
information or any sensitive information
1:19
such as credit card number or customer
1:21
address is stored within the
1:24
organization Once such sensitive data is
1:27
identified organizations can then take
1:30
appropriate steps to protect this
1:32
sensitive data In CDGC data
1:35
classification can be categorized into
1:37
two segments Rule-based data
1:39
classifications are predefined rules
1:42
which are created using the metadata and
1:45
or data of the columns or fields CDGC
1:48
provides 200 plus out ofthe-box
1:51
rule-based data classification and users
1:54
can also create their own custom
1:56
rule-based data classifications Clear
1:58
generated data classification These are
2:01
automated classification using the
2:03
Informatica's AI engine Clear clear
2:06
generated data classifications works
2:08
only on the metadata of the columns or
2:14
fields Now rulebased data classification
2:17
can be further segregated into two types
2:20
Data element classification and data
2:22
entity
2:24
classification Data element
2:25
classifications are basically these are
2:27
the rules which are applicable at the
2:29
column or field level These are written
2:32
in Spark SQL Data element classification
2:35
rule can be designed to utilize only the
2:38
metadata of the column or field only the
2:41
data in the column or field or it can
2:44
also use combination of both metadata
2:46
and data of the column or field For
2:49
example classify a column as social
2:52
security number if the column name
2:55
contains the keywords like SSN or SOC
2:59
followed by SEC Also the column should
3:03
contain more than 80% of the values in
3:06
the format like three-digit
3:08
number two-digit number hyphen
3:11
four-digit number So this example data
3:14
element classification rule is utilizing
3:17
both metadata that is name of the column
3:20
and data of the column which are content
3:22
of the column Data entity classification
3:25
is basically applicable at the table
3:27
level and it is dependent on data
3:30
element classification because tables
3:33
will be having column So at the column
3:36
level data element classification will
3:38
be applicable and using that data entity
3:40
classification will be derived at the
3:42
table level or at the fine level Data
3:45
entity classification can be designed to
3:48
consider all any or some of the selected
3:51
data element
3:53
classifications For example let's say if
3:57
full name gender date of birth email or
4:01
phone number data element
4:02
classifications are identified in one or
4:06
more columns of a table then that table
4:09
can be classified as person entity So
4:12
basically in this example a table is
4:15
classified as person when it contains
4:18
full name gender date of birth email or
4:21
phone information Clear generated data
4:24
classification is only data element
4:27
classification It does not have data
4:29
entity classification which means clear
4:32
can only generate data classification at
4:34
the column level or field level This
4:36
classification is automatically
4:38
generated at the column level by
4:41
Informatica's AI engine CL clear which
4:43
is based on machine learning algorithms
4:45
When you enable this feature in any
4:47
catalog source Clear uses some
4:50
predefined rules to analyze the column
4:52
or field metadata mainly the name of the
4:55
column or name of the field and
4:57
automatically generates data
4:58
classification for the columns or fields
5:00
of the tables or or files Clearbased
5:03
data classification only works on the
5:05
metadata of the columns User can use
5:08
this feature when he is not aware about
5:11
the metadata of the column or the
5:13
content inside the column and hence he
5:15
cannot write an effective rule-based
5:18
data element classification In such
5:20
cases user might be interested in using
5:23
clear generated data classification
5:29
When you create a rule-based data
5:32
element classification you have to
5:35
specify whether the process should
5:37
consider conformance percentage or a
5:39
weighted conformance percentage So when
5:42
you create a rule-based data
5:44
classification using the data of the
5:47
column it uses the column profiling data
5:50
available in
5:52
CDGC So on your screen you can see a
5:54
sample column profiling data which CDGC
5:57
stores So consider a column called
6:00
gender in any table which has three
6:03
values male female and null or empty The
6:07
CDGC stores its value frequency which
6:09
means male is appearing 11 times and the
6:12
percentage is 47%age in all three values
6:16
Similarly for the female and for the
6:17
null or empty values Now this data will
6:21
be consumed by the process for
6:23
conformance percentage and weighted
6:25
confformance percentage Let's see how
6:28
conformance percentage is calculated So
6:31
the process will consider only the
6:34
unique values like male female and null
6:38
or blank values So these are unique
6:40
values Hence their occurrence will be 1
6:43
only While calculating the conformance
6:45
percentage 1 will be divided by the
6:48
total which is 1 + 1 + 1 which is 3 So 1
6:51
divid 3 is 33% So for each value the
6:54
conformance percentage is
6:56
33% And now if we have written our data
7:00
element classification to match male and
7:02
female values Hence to sum the
7:05
conformance percentage of male and
7:07
female it will be
7:09
66.66% as the final conformance
7:12
percentage If you want to calculate the
7:14
weighted conformance percentage in that
7:17
case process will consider the value
7:20
frequency So instead of taking unique
7:23
values count it will take the value
7:25
frequency So male is appearing 11 times
7:28
female is appearing nine times and blank
7:30
or null values are appearing three times
7:32
So it will divide 11 by sum of all three
7:35
which is 11 + 9 + 3 So for each value it
7:39
will calculate the weighted confformance
7:41
percentage and if we have written the
7:43
rule to match only male and female
7:46
keywords then the total overall weighted
7:50
confformance percentage will be
7:52
86.96% Now the question comes is in
7:55
which situation we should use
7:57
conformance percentage compared to the
7:59
weighted confformance
8:01
percentage So if your column is having a
8:04
lot of unique values and the percentage
8:07
of blank or null values should be
8:09
ignored in those cases we should
8:11
consider conformance percentage So for
8:14
example SSN name country phone number
8:18
credit card these columns will be
8:21
usually having many unique set of values
8:24
So for this we can consider confformance
8:26
percentage
8:28
But if your column is having very few
8:31
unique values and percentage of blank or
8:34
null is very important to consider in
8:37
that case we consider weighted
8:39
conformance percentage So few example
8:41
could be gender USA city which means a
8:44
city of any specific country because
8:47
that can be repeated again and again
8:49
ethnic group some kind of flags So these
8:52
columns will be generally having
8:54
repeated values and so for these columns
8:56
we can consider weighted conformance
8:59
percentage while designing any
9:00
rule-based data element
9:06
classification In order for data
9:08
classification feature to work there are
9:11
few prerequisites First is metadata
9:14
extraction is mandatory for any kind of
9:17
data classification rule to work Second
9:20
profiling with keep signature and values
9:23
option is mandatory only if you are
9:26
designing rule-based data classification
9:28
rule which is using the column data If
9:31
your data classification rule is
9:34
designed to work only on the metadata
9:37
then profiling is not
9:42
required Let's take few examples of
9:45
rule-based data element classification
9:48
So let's say our requirement is to
9:50
identify the columns where social
9:52
security number data resides in the
9:55
organization and tag them as SSN So for
9:59
this we have rule like the column name
10:03
should contain any of the keywords like
10:06
SSN or SOC SEC social security number
10:11
social hyphen security hash or at least
10:15
80% of the data in the column should
10:17
follow pattern like three-digit number
10:20
two-digit number four-digit number or
10:23
other variant of this So this rule is
10:25
basically considering metadata or data
10:29
So either metadata matches or data
10:31
matches then the column will be flagged
10:33
as SSN If we use and operator then both
10:37
will be matching In the second example
10:39
identify the column where credit card
10:42
data resides in the organization and tag
10:44
them as credit card and for this we are
10:47
considering both metadata and data
10:50
Similarly in the third rule we are
10:53
trying to identify USA individual tax
10:56
identification number and tag it as USA
10:59
individual tax identification number
11:02
ITIN where the column name should
11:05
contain any of the keyword as listed
11:07
here and more than 80% of the column
11:11
data should match the given pattern like
11:14
three-digit number hyphen twodigit
11:15
number hyphen four-digit number and
11:18
other variant of this So this rule is
11:20
containing both column name along with
11:22
pattern of the column data Let's take
11:25
one example of data entity
11:28
classification which is a classification
11:30
which is applicable at the table level
11:33
or file level So let's say identify a
11:36
table that contains personal
11:39
identifiable information and tag it as
11:42
PII To identify a table as PII consider
11:47
the rules like the table must contain
11:50
either full name or SSN or email
11:54
information In order to identify this
11:56
information in the table column the
11:58
process must look for the rules as
12:01
specified here like classify a column as
12:04
full name when the column name contains
12:07
keyword like name or full name or
12:10
complete name So for full name we are
12:14
using only the metadata For SSN we can
12:17
use the same logic as we discussed in
12:19
the previous slide And for the email the
12:22
column should contain the keyword like
12:25
mail or at least 80% of the column data
12:28
should be in the format is specified
12:30
here Now you might be wondering how this
12:33
data entity classification gets tagged
12:35
at the table or file level When you
12:38
create a data entity classification you
12:40
have to select one or more data element
12:43
classifications along with inclusion
12:45
rule to consider all or any or some of
12:49
the data element classifications to tag
12:51
a table with the data entity
12:54
classification The process will first
12:57
identify and tag the columns with the
13:00
data element classifications that are
13:02
part of a data entity classification
13:05
Depending on the inclusion rule if all
13:08
or any or some of the data element
13:10
classifications are tagged to the
13:12
columns of a table then that table is
13:15
tagged with the data entity
13:21
classification For this demo I have
13:24
created one Oracle catalog source with
13:27
metadata extraction enabled along with
13:31
data profiling
13:33
enabled and then we'll enable data
13:36
classification capability In this demo
13:39
we are going to look into rule-based
13:42
data classification Hence I will select
13:45
data classification rules In the next
13:47
video we will look into generated data
13:49
classification So for now we will ignore
13:51
this
13:52
option Now we need to add the data
13:55
classification against which we want to
13:58
analyze our data So as we discussed in
14:01
the previous slide we want to find all
14:04
the columns which contains social
14:06
security number credit card information
14:09
and USA individual tax identification
14:12
number Hence I will add these data
14:15
element classification here Click on add
14:18
data classification and here it will
14:21
list all the data element and data
14:24
entity classifications available in your
14:27
org Now it lists both out of the box and
14:31
any custom created data classifications
14:34
In case if you are not able to see these
14:36
data classifications here just navigate
14:39
back to
14:40
explore and from the drop-down select
14:43
the data classification and all the data
14:46
classification should be visible here If
14:49
you're not able to see even out
14:51
ofthe-box data classification simply
14:53
click on import predefined content and
14:55
follow the process to import the out
14:58
ofthe-box data
15:00
classification Let me add the data
15:02
classifications here So first one is SSN
15:09
Then we need to add credit
15:14
card and another one is US individual
15:17
tax identification
15:21
number All these three are out of the
15:25
box provided data element
15:28
classifications Now we also want to
15:30
identify the tables that contain
15:33
personal identifiable information So for
15:37
that I have created one custom data
15:40
entity classification called PII Let me
15:43
add that
15:49
here and then we can save the
15:52
changes Let me also show you how this
15:55
PII data entity classification looks
15:58
like because this is the custom one
16:02
So here I have selected three data
16:05
element classifications full name email
16:08
SSN and the inclusion scope is anyone
16:11
which means tables that contains any of
16:13
these information will be tagged as PII
16:16
data If you want to check for all these
16:20
data element classification then you
16:22
have to select all option or if you want
16:24
to select let's say two of these data
16:27
element classification then you have to
16:28
specify the number two here So which
16:30
means tables that contains all of these
16:34
data element classification will be
16:36
tagged as PIA data In case if you select
16:39
include option as all but we have
16:41
selected any and then we have specified
16:43
one that means table that contain any of
16:46
these information will be tagged as PII
16:50
And of course since these are data
16:52
element classification so when any table
16:54
is getting tagged as PII the underlying
16:57
column will also be tagged with any of
16:59
these data element classifications Now
17:02
we can run this catalog
17:06
source since this is the first time we
17:08
are going to execute this catalog source
17:10
So we'll run with all the three
17:12
capabilities selected But in case if you
17:15
have already executed the metadata
17:17
extraction and data profiling then you
17:19
can simply run just data classification
17:22
as well unchecking other two
17:26
options Catalog source is successfully
17:29
executed with data classification
17:32
capability Expand the data
17:34
classification and you will see it was a
17:36
rule-based data classification execution
17:38
Click on that and you will be able to
17:40
see the stats like the total number of
17:42
data classifications that are evaluated
17:45
the total number of columns that are
17:47
evaluated the unique number of data
17:50
classification and the total number of
17:52
data classification which are inferred
17:54
or the total number of data
17:56
classification inference which are
17:57
deleted So here we can see there are
18:00
total 11 associations which are inferred
18:03
So click on that and it will show you
18:05
the stats Here you can see the stats
18:09
like full name is a column which is
18:12
classified as full name data
18:14
classification that belongs to the
18:16
passenger table Similarly for all the
18:19
columns it is showing Now if you see
18:21
this one it is saying passenger of type
18:24
table is classified as PII PII is a data
18:27
entity classification which is inferred
18:30
on the table passenger and rest all are
18:33
data element classifications
18:36
Let's navigate back to CDGC and see the
18:39
result there So I have opened the
18:41
catalog source Let's go to the table Let
18:43
me open the passenger table And here we
18:47
can see on the passenger table PII data
18:51
entity classification is inferred If I
18:54
hover on this it shows the detail like
18:56
its type is data entity classification
18:59
And since it is in accepted state so
19:01
score will always be 100 Go to contains
19:04
and it will list all the columns which
19:05
are part of this table And here we can
19:08
see for all these columns there are data
19:12
element classifications which are
19:13
inferred like on full name column full
19:15
name data element classification is
19:17
there On contact email classification is
19:19
there on identity document SSN
19:22
classification is inferred and on CCN
19:25
column there are two classification SSN
19:27
and there is one more So we can simply
19:30
click on this
19:31
column and here it will show that CCN is
19:35
inferred with two data element
19:37
classification One is credit card and
19:39
another one is SSN I can curate this
19:42
data classification from here So let's
19:45
say I I want to reject this SSN data
19:47
element classification on CCN column So
19:50
I can click on this and it will be
19:52
rejected
19:55
PIA data classification was inferred on
19:59
this table Why Because as per this PIA
20:02
data classification rule the table
20:05
should contain at least any of these
20:08
information like email SSN or full name
20:11
And if I go back to the passenger table
20:13
go to the column it contains email it
20:16
also contains SSN and it also contains
20:18
full name So if any of these information
20:20
is found then this table is classified
20:24
as PII Similarly let's go back to the
20:27
other table which is pilot table and
20:30
here also PI data entity classification
20:33
is inferred Why Let's check the columns
20:36
and here I can see full name and SSN is
20:38
also inferred here and that's why this
20:40
this table is also having PII data We
20:44
can also utilize search queries to find
20:48
the assets where there are data
20:50
classification associations For example
20:52
if I use this search query data elements
20:55
in this catalog source which are having
20:59
the data classification associations and
21:01
the curation status is either accepted
21:04
or rejected I will be able to see the
21:07
total number of columns So here it is
21:09
showing seven columns are there where
21:11
there is data classification association
21:14
Similarly I can use different variant of
21:16
the search query to get the required
21:22
result Let's have a quick look at some
21:25
of the frequently asked
21:28
questions What metadata can be used for
21:31
defining data element classification
21:33
rules So when you define a data element
21:36
classification rules you can utilize
21:39
four things
21:41
name of the table or column and comment
21:44
on the table or
21:47
column What profiling data can be used
21:50
for defining data element classification
21:53
rules So when the data classification
21:56
rule is written to use the profiling
21:59
data you can utilize bunch of the
22:02
attribute from the CDGC profiling such
22:05
as the count of the values profiled for
22:07
any column the distinct values count of
22:10
any column the inferred data type of the
22:13
column average value of the column
22:15
standard deviation value of the column
22:18
minimum values of the column maximum
22:20
value of the column null percentage of
22:22
the column most frequent values of the
22:25
column or we can utilize the distinct
22:28
values of the column or value frequency
22:31
of the column like how many times a
22:33
particular value is appearing in the
22:35
column and the frequency percentage of
22:37
the column Usually when we define
22:41
rule-based data element classification
22:43
using profiling data we use value
22:46
frequencies data like value or frequency
22:49
of the values or frequency percentage
22:53
Some of the common search queries
22:54
related to data classification is listed
22:57
on the screen Which language is used for
23:00
defining rule-based data classification
23:03
CDGC uses Spark SQL syntax for defining
23:06
classification rules I have provided one
23:09
link to the Spark SQL for your reference
23:12
Between rule-based data classification
23:14
and clear generated classification which
23:17
one is recommended
23:19
Rule-based data classification always
23:22
provides better confidence because when
23:25
designing rule-based data classification
23:28
user needs to be aware about the
23:31
metadata or data of the column against
23:35
which he wants to run the rule and so
23:38
user knows what he is trying to evaluate
23:42
This also makes the curation of the
23:44
association easier both technically
23:47
because it can be done in bulk through
23:50
the bulk import process and even
23:53
logically because user will be aware
23:55
about the context of the column or table
23:59
and also about the data classification
24:01
rule against which he's trying to
24:03
evaluate the column or table data When
24:06
should rule-based data classifications
24:08
be used compared to clear generated data
24:11
classifications Rule-based data
24:13
classifications are suitable when the
24:15
user has a clear understanding of the
24:17
technical assets context like about its
24:20
metadata and or data In such cases the
24:24
user can easily define rules to classify
24:27
the data But consider some cases where
24:30
the user doesn't have information about
24:33
what a particular column is related to
24:35
or what kind of data it contains In such
24:39
cases defining rule-based classification
24:41
can be challenging In those scenarios
24:44
clear generated classification can be
24:46
utilized Clear is Informatica's machine
24:49
learning based AI engine and it uses
24:52
internal algorithms to analyze column
24:55
names and generate placeholders for the
24:58
data entity classifications These
25:00
placeholders are listed under generated
25:03
classification section of the asset in
25:06
CDGC The user then has the option to
25:09
either promote or reject these clear
25:12
generated placeholders If the user
25:15
promote a placeholder they can link this
25:18
placeholder to an existing data entity
25:20
classification or they can create a new
25:23
data entity classification and associate
25:26
this placeholder with the created data
25:28
entity classification The content
25:31
discussed in this video is available on
25:33
my website You can find the link in the
25:36
video description below That's all for
25:39
today See you in the next video Until
25:41
then take"