ETHICAL AND LEGAL ISSUES IN
DATA SCIENCE
(FOR PRIVATE CIRCULATION ONLY)
2021
1 i
ABOUT THE AUTHOR
Mr. Kedar Kanhere is a Senior Data Scientist with more than 7 years of
experience in the field of Data Science. He has worked in various domains like
Retail, Supply Chain, Finance, Technology etc. After graduating from Pune
University, Kedar Kanhere got an opportunity to work on forecasting techniques,
Analytics, and statistical tool development algorithms. As a Data Scientist, Kedar
has delivered analytics and forecasting projects for companies like Walmart, Apple
and Microsoft in which he extensively used tools like R, SAS, Python, Tableau,
and Micro-Strategy. He has handled analytical projects on the development side of
software products. Mr. Kedar also holds a certification in Advanced Data Science
Course from IIM Ahmedabad.
2 ii
CONTENTS
Unit
Title
No. Page No.
What are Ethics?
1.1 What are Ethics?
1.2 Sources of Ethical Standard 1–27
1.3 Importance of Ethics in Data Science
1.4 Ethical Concerns of Big Data
1 1.5 Key Ethics Principles for Data science
Summary
Keywords
Self Assessment Questions
Answers to Check your Progress
References
2
Some Ethical Concerns of Data Science
2.1 Introduction
28–53
2.2 Challenges and Categorizing them
2.3 The Five Cs
2.4 Implementing the Five Cs
Summary
Keywords
Self Assessment Questions
3
iii
Answers to Check your Progress
References
History, Concept of Informed Consent
3.1 Introduction
3.2 Informed Consent in Data Science
3.3 Limitations and Challenges
54–80
3.4 Three Key Lessons
3
3.5 Conclusion
Summary
Keywords
Self Assessment Questions
Answers to Check your Progress
References
4
Data Ownership
4.1 Data Ownership
4.2 Its Meaning and Challenges
4.3 Importance of Data Ownership and Ownership
Policies
4.4 Limits and Issues in Data Ownership
4.5 Levels of Ownership
4.6 Data Hoarding and Destruction 81–108
Summary
Keywords
Self Assessment Questions
4iv
Answers to Check your Progress
References
Privacy, Anonymity And Data Validity
5.1 Privacy
5.2 Anonymity
5.3 Data Validity
5 109–140
Summary
Keywords
Self Assessment Questions
Answers to Check your Progress
References
6
Algorithmic Fairness
6.1 Introduction
6.2 Correlated Attributes, Misleading but Correct
Results and P-hacking
6.3 Definitions of Fairness 141–166
6.4 Potential Causes of Unfairness
6.5 Other Methods to Increase Fairness
6.6 Final Comments
Summary
Keywords
Self Assessment Questions
5
v
Answers to Check your Progress
References
Societal Consequences
7.1 Introduction
167–196
7.2 Distributional Unfairness
7.3 Ossification
7.4 Surveillance
7 7.5 Asymmetry
7.6 Other Impacts
Summary
Keywords
Self Assessment Questions
Answers to Check your Progress
References
8
Code of Ethics
8.1 Introduction
197–226
8.2 What’s next?
8.3 Challenges and Principles
8.4 Code of Ethics
8.5 Areas of Focus for an Ethical Analyst
8.6 Guidelines for Data Analysts
8.7 Conclusion
6
vi
Summary
Keywords
Self Assessment Questions
Answers to Check your Progress
References
7 vii
What Are Ethics?
UNIT
1
Structure:
1.1 What are Ethics?
1.2 Sources of Ethical Standard
1.3 Importance of Ethics in Data Science
1.4 Ethical Concerns of Big Data
1.5 Key Ethics Principles for Data science
Summary
Keywords
Self Assessment Questions
Answers to Check your Progress
References
1
Objective:
After going through this unit, you will be able to understand ethical
standards build the framework to analyse these concerns understand
who owns data, how we evaluate various aspects of privacy, consent
and fairness.
1.1 WHAT ARE ETHICS?
We live in the 21st century, where every aspect of our society is being
revolutionized by the unprecedented recording and analysis of data.
Data Science has been called “the sexiest job of the 21st century.”
This is not a passing buzz. It is a revolution. We will be recording
much more data and analysing these data much more effectively
tomorrow than we do today.
Ethics deals with what is morally good and bad and right and wrong.
Ethics are rules that we all voluntarily follow, because it makes the
world a better place for all of us. They are the foundation of
civilization.
To most people, their morals are informed by religious teachings.
Even though people may follow different religions, most religions
agree on many moral principles; for example, that it is wrong to harm
another human being who has done no harm to you. Most religions
promote ethical behavior, but ethics aren’t necessarily religious.
2
If you’re doing things with data, you have a great deal of power. And
with it, comes a great deal of responsibility. Data scientists who have
had ethical training will result in more ethical practice of data science.
This is good not only for data science but also for society at large.
Let us consider an example. Ethical principles stop us from cheating
in exams. Now, there are other things that could stop us from cheating
in exams. Let’s say, if there is an invigilator, we may not want to
cheat in the exam, because we’d be afraid that we’ll get caught, but
Ethics stops us from cheating even if there’s no invigilator present or
is present and we have a chance that we won’t be caught.
Ethics are a fundamental principle of what we think is right. Ethics
also are not laws. So consider an example where you go to a crowded
bus stop. Everyone is patiently waiting in a queue. When the bus
arrives, you break the queue and jump forward. You may not have
broken any law. It's unlikely that you will be punished for breaking
the queue but we would all agree that in doing so, you would have
been unethical, because there's a shared societal principle of following
the queue.
While ethics are not laws, laws are enforced to follow ethics. Laws are
often created to enforce these shared social values. The problem is,
even if we have shared values, everyone will not be ethical.
As discussed, we all have a shared value that says it’s wrong to cheat.
That doesn’t necessarily mean that no one has cheated in exams. It
just means that we, societally have agreed that cheating is wrong. To
reinforce this ethic we have laws that say if you break this rule, then
3
there are punishments, and these punishments are determined by law.
In this sense laws are used to help enforce ethical behavior.
When each individual works to maximize their own individual benefit,
society as a whole does best. This is the basis of the free market that
underlies so much of the modern economy. But there also are
situations where individual benefit comes at a cost to society, and
when this happens, such situations are addressed by shared societal
value systems i.e. ethics.
To summarize, a rule is something that limits what we can do and
therefore there’s a cost associated with us following the rule. But you
might still like to have a rule in place if the cost of you following this
rule is less than the benefit to you if others follow the same rule.
Let’s consider a simple example. Suppose we had no rules about how
we would drive, we would have chaos on our roads. Traffic rules have
all of us agreeing to a particular rule, and agreeing to drive on a
particular side of the road. If we didn’t have these rules, there would
be chaos and accidents.
Another example: if we all agree as a social value that we shouldn’t
litter, and most of us are good about throwing our wrappers in the
trash, and not on the street, then the streets are clean, and it it easier
for us to follow the rules, because we get the benefits of following the
rules since everybody else is also following the rules.
To sum up, ethics are shared rules that we all agree to follow as they
are beneficial to everyone. These rules are typically infused with a
sense of right and wrong, good and bad.
4
1.2 SOURCES TO ETHICAL STANDARDS
The five major sources of ethical standard are:
1 Utilitarian approach
2 Rights/Deontological approach
3 Fairness approach
4 Common good approach
5 Virtue approach
1. Utilitarian approach
Imagine that the CBI is tipped off of a plot to set off a dirty bomb in
the capital city. Agents capture a suspect who has information about
where the bomb is planted. Torturing the suspect into revealing the
bomb’s whereabouts in order to save many others is correct on their
behalf? If ‘yes’ popped up in your mind, you were probably using
utilitarianism. In simplest terms, utilitarianism is a moral principle that
believes that the way to act in any situation that is morally right
creates a balance of benefits far greater than the disadvantages of
everyone affected. As long as the action brings maximum benefit to
everyone, utilitarianism does not care whether it has been produced by
being lied to, manipulated or coercion.
To understand what we must do in any situation, we follow three
steps:
(i) Identify the different courses of action that we can perform.
5
(ii) We determine all the prior advantages and disadvantages
arising from each action taken for each person affected by the
action.
(iii) We choose the course of action that offers the greatest benefits
after all the disadvantages have been taken into account.
Perhaps the biggest drawback of utilitarianism is that it fails to
consider justice.
2. Rights approach
Respect for human dignity is the highlight of the right approach. This
approach is based on our ability to choose how we can live our lives
and we have a moral right to respect our choices as independent, equal
and rational people and a moral duty to respect others in the same
way. Other rights include the right to privacy, truthful information
about matters affecting our choices, and protection from harm and
injury.
This approach allows us to identify our own and others' legal rights as
well as our duties and responsibilities in the given situation. When
conflicting or competing interests or rights are confronted, we need to
decide which interest is best suited and prioritize the right to best
protect or ensure that interest. For example, generally the right to
freedom of speech is protected, but citizens do not have the right to
shout “bombs” unnecessarily at crowded public places or participate
in hate crimes.
6
3. Fairness/ Justice approach:
Across the spectrum of society, the Fairness Approach focuses on the
fair and equitable distribution of good and harm, and/or the social
benefits and social costs. It begins with the presupposition that
everyone should be treated equally, and those who are unequal due to
relevant differences, should be treated differently such that it is fair
and proportionate to their difference. A general example is the
payment of employees at different salary levels to a group of
employees based on their contribution to the corporation's profit from
their work efforts.
Here we evaluate our actions in terms of the performance of the
people affected. Are they treated the same as those who are in similar
matters? Are those who differ in related ways treated differently in
terms of legal distinction and merit? Is there a place where some get
benefits for no good reason? What are the relevant factors that
determine similarities and differences of t in a group?
4. Common good approach:
The Common Good Approach treats all people as part of a larger
community. Similarly, we share some common situations and
organizations on which our well-being depends. For the society to
prosper, it is necessary to protect the sustainability of our community
for our benefit, including weak and vulnerable members. Some of the
things that nurture a healthy, functional community are: stable family
life; Good school; Affordable nutrition and health care; Effective
7
public safety; Just a legal system; Fair trade and commerce; A secure,
well-managed ecosystem; An accessible technology environment;
Well-maintained infrastructure; And a peaceful society.
While the utilitarian principle weighs the net balance of goodness and
harm or wrong and lesser wrong produced by a particular action on an
individual group, this approach examines whether an action benefits
from a particular component of the common good.
5. Virtue approach:
The Virtue Approach is focused on individual character and
disposition which strengthen our humanity and enhance our
relationships with others. Honesty, kindness, restraint, civility,
compassion, diligence, self-reliance, loyalty, generosity, patience,
endurance, conscientiousness, generosity, restraint, self-control,
discretion, etc. are valued by almost all the cultures.
This approach leads us to question whether the given action is a
reflection of the type of person we are or want to be. Will it encourage
the kind of character we value in ourselves and for our society? Does
this represent the type of desire my business has? If things don't go as
planned, is it possible for us to “live” with it?
8
1.3 IMPORTANCE OF ETHICS IN DATA SCIENCE
New technologies often present us with new moral questions. The rise
of nuclear weapons, for example, put great pressure on the distinction
between combatants and non-combatants, which was the only centre
of war theory developed in the Middle Ages.
With the emergence of new techniques of machine learning and the
possibility of using algorithms to carry out tasks previously performed
by humans, as well as the possibility of generating new knowledge,
we again have to face a set of new ethical questions. These questions
are not only about the possibility of harm due to misuse of data but
also about how to protect privacy when data is sensitive, how to avoid
bias in data selection, how to avoid interruptions and data “hacking”
and maintaining transparency in data collection, research and
dissemination. The biggest of these questions is a larger question
about who owns the data, who has the right of access to it under which
conditions.
At the moment, these questions remain unanswered, however, it is
very important to face them and try to work on shared ethical
guidelines. When agreement is not possible, it is important to
approach competitive values and specifically clarify the underlying
assumptions employed in the different models. The way in which
models are built by scientists have an impact on justice, health and
opportunities in people's lives. And it is our duty to reflect on the
righteousness of our discipline every day.
9
If built correctly, algorithms have massive power to do good in the
world. Cost savings, scalability, accuracy, speed, and consistency are
the advantages when algorithms are used for tasks which were
previously done by humans. Generally, results tend to be more fair
and less subject to social bias when using a system that is more
accurate and consistent than a human being.
The Data-Driven Justice Initiative seeks to prevent incarceration by
using data to help people with mental illness, substance abuse and
health problems access the resources they need and stay out of
prisons. These solutions and similar other initiatives not only save
money but also save lives.
However if Data Science is not handled ethically, sensitive
information can be used incorrectly and cause harm unintentionally.
Private and sensitive information such as photographs, passwords,
location information, etc. can end up in the wrong hands.
Predictive models used for policing and sentencing hearings reinforce
stereotypes and have adverse racial or socio-economic implications.
Financial opportunities can be denied in the form of school
admissions, recruitment and loan approvals. Healthcare decisions
could be made incorrectly, compromising a person’s health sometimes
resulting in their death. And of course, when data scientists use the
power of data to sow mistrust and discord, our democratic system
itself can crumble.
If you are not on your guard, it is easy for a data scientist having good
intentions, to make unethical decisions. While viewing data, we tend
10
to forget that it’s only as accurate and objective as the people and
processes used to collect and compile it in the first place. Since
modern machine learning tools are complex, they are often difficult
for humans to interpret and understand. This results into making it
difficult to determine appropriate inputs and ethical implications of
results. It’s almost like the answer is coming from a Data Deity, which
most of us do not understand but has our faith and trust.
Most data scientists are trained in disciplines like computer science,
applied mathematics, or statistics. In these fields, data science is often
used for academic theory and research, rather than to provide
information about real-world behavior that affect people’s lives.
1.4 ETHICAL CONCERNS OF BIG DATA
We have unparalleled access to data today and unparalleled options
for analysis of this data. And so there's virtually no limit to what data
science could do. The question here is, should we do everything that's
possible? Are there things that are possible to do that we agree we
should not do?
Data science gives us the ability to do things in various aspects of
society and create a huge impact. It has social consequences both good
and bad. This impact also includes undesired consequences regarding
privacy, fairness, equity and many more. Ethics guides us on how we
decide what is okay to try and do and what isn’t.
Data scientists do work that has tremendous potential for delivering
great values. It can help the organization, the people and society in
11
many different ways. For reaping the benefits of data science, we need
to develop a shared sense of ethical values while minimizing the harm
that it could possibly do.
For example, unsolicited email was considered a great idea when the
internet was new. Over time spam became a big problem, and today,
no law abiding business will own up to spamming intentionally.
Because now, that is considered socially unacceptable.
Organizations have been using data to gain useful insights for their
organizations for a long time now. But with the big data revolution,
this has increased tremendously. The companies hire data
professionals to use this data to their benefits. These professionals will
be given authority to read and produce insights and hence they will be
handling a lot of sensitive data from the common people as well as
large organisations. That is why data scientists must adhere to a code
of conduct in their day-to-day work processes.
Given below are some of the ethics concerning relationship between a
client and data scientist:
● Decision Making
Data scientists must not make decisions without consulting a client
under any circumstances. Although the decision that the data
professional has in mind is for the betterment of the project, they
should make the decision only when they are fully compelled to do so
as per the agreement or their authority.
12
● Communication With the Client
Transparency has to be maintained between a data scientist and the
client, at all times. The clients should always be well informed about
the different ways the project is being handled, i.e., a data scientist
must keep the clients in loop about things like what data is being used,
where it is being used, and how it is being used. The client should also
be informed of the progress made and consulted about any real or
potential hidden risks depending on the outcome of the data science.
The client should be able to make well informed decisions regarding
data science and hence the result must be explained clearly and
thoroughly to them.
● Confidentiality
Data scientists are always involved in creating, developing and
receiving information. Therefore it becomes a data scientist’s duty to
protect the confidential information regardless of the type. This type
of information should be discussed or talked about only when the
client allows data scientists.
13
● In The Event of Conflict of Interest
A conflict of interest is bound to occur when a data scientist’s service
for one client is directly hostile to another client, orThere is a
significant risk that will limit the responsibility of data scientists about
a client due to the work of another client or other parties who may join
it.
● Dealing With Potential Clients
A prospective client is someone who is in constant contact with the
data scientist but is not a client yet. During the discussion this
potential client shares valuable information with data scientists who
need to be responsible when working with this data.
● Always Being Informative
If there is a latest trend or new development in the field regarding a
project that the data professional is working on, it is his/her
responsibility to explore different ways to enhance the project.
1.5 KEY ETHICS PRINCIPLES FOR DATA SCIENCE
1. Collect Minimal Data and combine it.
For companies intending to protect their users and data they need to
make sure to collect the only necessary data. A lot of data does not
14
guarantee that there is a lot of usable data. It is important to keep data
collection concise and deliberate. In order to protect privacy, relevant
data must be held in high regard.
Keeping data together is also important to protect privacy and
maintain transparency. From machine thinking and autonomous cars,
to data science and predictive analytics, in almost everything,
algorithms are being used. The algorithms used for data collection
allow companies to observe behavior among very specific patterns
and customers while preserving their identities.
Hui Xiong, an associate professor of management science and
information, systems states:
“One way companies can harness this power while heeding privacy
worries is to aggregate their data. If the data shows 50 people
following a particular shopping pattern, stop there and act on that data
rather than mining further and potentially exposing individual
behavior.”
Google, Apple, Facebook, Microsoft, and Amazon collect the most
private and sensitive information and hence have the most
responsibility. Apple typically has the strongest parameter for
analyzing and safeguarding the data collected from users as they
understand data. An article states, “Apple even opposed the FBI over
a case it said would set a ‘dangerous precedent’ for user privacy.”
15
2. Scrub the sensitive data after identifying it.
Employees in the field of information science need to understand
which data is delicate and personal and identify the ways to utilize this
information.
If consumer information is collected without consent, it must be
scrubbed of insight that can lead the figures to be personally
identifiable.
An article titled Five Ways to ‘Exploit’ Big Data Without
Compromising Privacy highlights the following:
Running too many regulations can lead to penalties, reputable
consequences and loss of customers. There are ways to reduce risk
while taking advantage of opportunities offered by data science.
Organizations need to implement data privacy solutions that prevent
infringement and enforce security:
← Identification of sensitive data.
← Ensuring the identified sensitive data is secured.
← Provide proof of compliance with all applicable laws and
regulations.
← Proactive monitoring of the data and IT environment.
← Respond faster to privacy or data breaches with incident
management.
16
3. Plan beforehand in case your insights backfire.
Whether you are aware of it or not, every time you go to the store and
purchase something, some form of information is collected about your
trip to the store. It can be your phone number, email address or some
other information. A typical example would be an online shopping
store analyzing your data and sending you relevant offers and
coupons.
An offline example would be: the retail giant like Big Bazaar or
DMart can develop a method based on 25 items that, when bought
together, usually indicated that a customer was buying a new house.
This type of customer awareness is great for understanding habits of
shoppers and for deciding which promotions and coupons to send out.
But this method did not necessarily work out. This process can
backfire and in that case the giant retailers’ analytics paint a picture of
big data that must include an immense attention to detail.
Data influences pretty much everything. There are practical based
industries centered around collecting anonymous information such as
healthcare. The criminal justice industry and agencies like CBI and
RAW consider big data to be one of the top technologies of the trade.
Professional coaches and even athletes are also getting their attention
and using figure-based wearable techniques to optimize performance
and reduce injuries.
Although the organizations behind data science have a duty to keep to
a set or code of ethics, we must be careful to keep our information
safe and secure.
17
Check your Progress
Fill in the blanks:
1 Ethics deals with what is morally ____ ___ ____ and _____ ___
_____.
2 The ________________ approach focuses on choice between
the worst and bad outcome.
3 The way in which models are built by scientists have an impact
on _________, __________ and _____________ in people's
lives.
4 Data scientists do work that has tremendous potential for
_________________ _______________ ______________.
5 Organizations need to implement data privacy solutions that
prevent ___________________ and enforce _________.
MCQ (one correct answer)
1 You know this kid from a very poor family and how hard his
family struggled to be able to send him to this class. You see
him really struggling with the exam, and you know that he
cannot afford to fail this class. Driven completely by kindness,
you let him copy answers from your exam book. Is your action
ethical?
a. Yes
b. No
18
2 You go to the beach and there are prominent signs asking you
not to litter and informing you that there can be heavy fines if
you are caught littering. However, you find that someone has
left their empty soda can, so you decide to leave your empty
soda can on the beach rather than bother to carry it back home.
Based on social consensus, is your action:
a. Legal and ethical
b. Legal but unethical
c. Illegal but ethical
d. Illegal and unethical
3 You run a small business, and keep all business records on an
unprotected personal computer. These records include
substantial information about customers. Since you are a small
business, you believe you are not a likely target for hackers.
Indeed, several years have gone by and no one has stolen any
information from your unprotected computer. Are your actions
ethical?
a. Yes
b. No
4 You go to the bus stop and everyone is patiently in line waiting
for the bus. Rather than wait in line, you just jump onto the bus
when it arrives. Your action is:
a. Legal and ethical
b. Legal but unethical
c. Illegal but ethical
d. Illegal and unethical
19
MCQ (multiple correct answers)
1 Rights approach allow us to identify
a. Legal rights
b. Duties
c. Responsibilities
d. None of the above
2 Which of the following about Ethics is incorrect
a. Ethics are neither Laws nor religion
b. Ethics are only laws
c. Ethics are only religion
d. Ethics are both Laws and religion
True or False
1 In case of dire situations such as a patient being severly ill, it is
ethical for the doctor to not be completely transparent with the
patient.
2 Utilitarian principle weighs the net balance of goodness and
harm or wrong and lesser wrong produced by a particular action
on an individual group.
3 Apple has the weakest parameter for analyzing and
safeguarding the data collected from users.
20
Activity
Give an example of a situation where we are all better off as a society
because we agree to behave ethically. (Example: like we discussed,
not stealing)
Summary:
➔ Ethics:
● Ethics guides us how to behave.
● Ethics are advocated by religion, it is not religion.
● Ethics are not laws, laws are implemented to ensure ethics are
followed
➔ Ethical standards:
● Utilitarian Approach
It is based on benefits and harms will each course of action
produce, and which alternative will lead to the best overall
consequences. It answers the dilemma of which option will
produce the greatest benefits and least harm.
● Rights Approach
The rights approach is based on the belief that every individual
has the ability to make their decisions freely. It believes that if a
21
method does not respect everyone’s moral rights, it is the wrong
method to act.
● Fairness/Justice Approach
It is based on course of action that treats everyone the same,
except where there is a morally justifiable reason not to, and at
the same time does not show favoritism or discrimination This
approach gives the individual the opportunity to reflect if their
action is fair and just to others.
● Common Good Approach
This approach helps drive our choice to decide if the action
taken will be good for ourselves as well as the community. It
also answers questions related to the type of society we want to
live in and how to achieve it.
● Virtue Approach
We strive to maintain and hold onto our internal values and
morals. This approach reflects the kind of person you are and
how you should be.
➔ Importance of ethics in data science
● With the emergence of new techniques of machine learning and
the possibility of using algorithms to carry out tasks and
generating new knowledge, we again have to face a set of new
ethical questions. These questions are :
○ possibility of harm due to misuse of data
22
○ how to protect privacy when data is sensitive,
○ how to avoid bias in data selection,
○ how to avoid interruptions and data “hacking” and
○ How to maintain transparency in data collection, research
and dissemination.
○ who owns the data,
○ who has the right of access to it under which conditions.
● It is very important to face these questions and work on shared
ethical guidelines. When agreement is not possible, it is
important to approach competitive values and specifically
clarify the underlying assumptions employed in the different
models.
● The way in which models are built by scientists have an impact
on justice, health and opportunities in people's lives. If errors
are made financial opportunities can be denied in the form of
school admissions, recruitment and loan approvals. Healthcare
decisions could be made incorrectly, compromising a person's
health sometimes resulting in their death
● When data scientists use the power of data to sow mistrust and
discord, our democratic system itself can crumble.
● Data is only as accurate and objective as the people and
processes used to collect and compile it in the first place. This
results into making it difficult to determine appropriate inputs
and ethical implications of results. Hence it is important to
make ethical decisions.
23
➔ Ethical Concerns of Big data
● Due to unparalleled access to data and unparalleled options for
analysis, there is no limit to what data science could do.
● Data science gives us the ability to do things in various aspects
of society and create a huge impact.
● Data scientists must adhere to a code of conduct in their day-to-
day work processes.
● General ethics should be followed while:
○ Decision making
○ Communicating with the client
○ Confidentiality of the client information
○ Working with a potential client
○ Being transparent with the client
➔ Key Ethics principles for data science
● Collection of minimal data
For companies intending to protect their users and data they
need to make sure to collect the only necessary data. Keeping
data together is also important to protect privacy and maintain
transparency. The algorithms used for data collection allow
companies to observe behavior among very specific patterns
and customers while preserving their identities. Apple typically
has the strongest parameter for analyzing and safeguarding the
data collected from users.
24
● Scrub the sensitive data after identifying it.
Employees in the field of information science need to
understand which data is delicate and personal and identify the
ways to utilize this information. Running too many regulations
can lead to penalties, reputable consequences and loss of
customers. There are ways to reduce risk while taking
advantage of opportunities offered by data science.
Organizations need to implement data privacy solutions that
prevent infringement and enforce security.
● Plan beforehand in case your insights backfire
Whether you are aware of it or not, some form of information is
collected every time you make a purchase online or offline. It
can be your phone number, email address or some other
information. Customer awareness is great for understanding
habits of shoppers and for deciding which promotions and
coupons to send out. This process can backfire and in that case
the giant retailers’ analytics paint a picture of big data that must
include an immense attention to detail. The criminal justice
industry and agencies like CBI, RAW consider big data to be
one of the top technologies of the trade. Although the
organizations behind data science have a duty to keep to a set or
code of ethics, we must be careful to keep our information safe
and secure.
25
Keywords
Data Science: The art of solving business problems with
software programming and statistics.
Big Data: Big data is a field that treats ways to analyze,
systematically extract information from, or otherwise deal with
data sets that are too large or complex to be dealt with by
traditional data-processing application software.
Self Assessment Questions
1. What are Ethics?
2. Which are the key Ethical Principles for Data Science?
3. What is the importance of Ethical Principles in Data Science (in
brief)?
Answers to Check your progress
Fill in the blanks
1 Ethics deals with what is morally good and bad and right and
wrong.
2 The Utilitarian approach focuses on choice between the worst
and bad outcome.
3 The way in which models are built by scientists have an impact
on justice, health and opportunities in people's lives.
26
4 Data scientists do work that has tremendous potential for
delivering great values
5 Organizations need to implement data privacy solutions that
prevent infringement and enforce security.
MCQ (one correct answer)
1 b
2 d
3 b
4 b
MCQ (multiple correct answers)
1 a,b,c
2 b,c,d
True/False
1 True
2 True
3 False
References:
1. https://www.kdnuggets.com/2016/07/ethics-principles-big-data-
science.html
27
Some Ethical Concerns of Data Science
UNIT
2
Structure:
2.1 Introduction
2.2 Challenges and Categorizing them
2.3 The five Cs
2.4 Implementing the Five Cs
Summary
Keywords
Self Assessment Questions
Answers to Check your Progress
References
28
Objective:
After going through this unit, you will be able to
understand various concerns of data science
understand how to categorize them
learn how to address them.
2.1 INTRODUCTION
Although data science has no value framework, organizations have
their value systems established. By asking and seeking answers to
various ethical questions, one can ensure it is used in harmony with
the organizations’ ethics and values.
Data science can be broken down into three components:
1) business intelligence, that is essentially about presenting
company data in front of the right people in various forms like
dashboards, reports, and emails.
(2) decision science, that takes data and helps a company in making
a decision.
29
(3) machine learning, that answers the questions like “how can we
take data science models and keep them in continuous
production.”
Undoubtedly, the future will be completely driven by Machine
Learning and Data Science forms the epicenter to this feature. They
refuel the machines with data they are trained on. Every self-driving
car, every advertisement, every medical diagnosis provided by a
machine will be based on certain data. For example, the
advertisements you see on different webpages are based on your
search history.
Data Ethics is a rapidly improvising field-of-study. Increasingly, those
collecting, sharing, and working with data are looking into the ethics
of their methods. Failure to handle data ethically can have serious
repercussions on people and can lead to a loss of trust in products,
projects or organizations.
We have a rough idea about how data science works in the tech
industry. First, data scientists lay a solid data foundation in order to
perform robust analytics. They then experiment online with other
methods for sustainable growth. Lastly, machine learning pipelines are
built and data products are personalized in order to understand their
business and customers better and to make better decisions. To put it
into simpler words, data science is about testing, machine learning for
decision making, and data products in the tech industry.
We are seeing rapid developments in the open-source ecosystem of
tools available for data science as well as in the commercial,
30
productized data-science tools. We are also witnessing increasing
automation in data cleaning, data preparation and other data-science
drudgery. It has been widely accepted that most of a data scientist’s
valuable time is spent comparatively more in finding, cleaning, and
organizing data, and less in actually performing analysis.
Data scientists make their living through collection, cleaning and
visualizing data; building reports and dashboards; communication of
the results to key stakeholders; statistical inference; and convincing
decision makers of their results.
Data Ethics is solely in the hands of data scientists and won’t be
modified any time soon. Data Scientists who have access to critical
tools that can influence people to think, have the potential to affect
their behavior, at most times, do not get a single hour of ethics
training. That is a problem that needs to be fixed.
In the ethical debate everyone associated with collecting and
controlling data should have a voice about how data should be used.
Organizations must openly address these difficulties in formal and
informal conventions.
2.2 CHALLENGES AND CATEGORIZING THEM
“Ethics is not solely about agreeing with the set of principles, but
more about changing the way you work.”
31
Ethical challenges arise when opinions change on what is considered
right and wrong. The major ethical challenges related to data and data
science are related to reinforcement of human biases, lack of
transparency, consent and power, privacy and unfair discrimination.
● Reinforcement of Human Biases
This type of problem usually arises in areas such as financial loans,
policing and insurance where various computer models are used in
making predictions. For example, if members of a certain racial group
have a history of defaulting on their loans, or have been more likely to
be convicted of a crime, then the models are most likely to consider
and categorize these people as riskier. This does not necessarily
guarantee that these people actually default on their loans or engage in
more criminal behavior.
● Lack of transparency.
Let’s take an example of peer graded assignments on online learning
websites. You submit your assignment that is meant to be evaluated
by your peers. The grading process returned two grades. One was the
average peer grade and the other was a computed grade that was a
product of some algorithm used to adjust bias. You do know how the
averaged peer grade was calculated but have no idea about how your
computed grade was calculated. In this scenario a case of low
transparency would be: you were just given your computed grade with
no explanation whatsoever and a case of high transparency would be
you received the computer calculated grade along with a paragraph
32
explaining how the grade was calculated, why adjustments were
made, the type of algorithm used, and you also received your raw peer
grade ratings and saw how each of these ratings was fine-tuned by the
algorithm to arrive at the final grade.
● Consent and power.
Let's take an example of electricity meters that calculate light bills.
Suppose there are two types: The first which has high tariff but data
anonymity is guaranteed and the other which has cheaper tariff but the
data is granular. As a result, the consumer can trade privacy for
electricity costs, potentially forcing low-income households to consent
to greater data sharing.
However, consumers first understand and access the data before
giving consent to sharing. This problem has been coined as the
transparency paradox: low-level data is difficult to understand, and
summary statistics hide important details — we cannot achieve one
without giving up on the other.
● Privacy.
In the above example, at first glance, did you consider electricity
consumption to be sensitive data? Did you know that the granular
meter readings can be used to determine whether a person is at home
or not? In depth analysis of these readings would also answer
questions like which appliances are used at what time, whether you
leave appliances on for longer than required and even features of
buildings you reside in. In this way, sensitive details of a consumer’s
33
daily life could be exposed and used in ways that invade individual
privacy.
● Unfair Discrimination
If the data reflects unfair social biases about sensitive traits such as
race or gender, the conclusions drawn from these data may also be
based on those biases.
Users of Algorithms
Another way to categorize ethical issues in data science is by people
using these methods.
● Business.
The public receives a value for services in return for using their
data. and it is clear that (a) people value these services, and (b)
people are used to the advertising-based model and are unlikely
to pay for these services. In addition, they do have options in
configuring what they actually share. A study found that as little
as 15% of users blocked cookies or javascript, while the
remaining don’t.
● Government.
The “surveillance state” threatens. There are over half a million
surveillance cameras in London that feature in most British
crime dramas. At a practical level, British citizens may not be
bothered by their presence but in autocratic states it is a
34
different matter. For example, comprehensive and detailed
individual data collection enables officials in some non-
democratic societies to have uncontrolled and destructive
control over individuals and communities.
● Personal data
Almost every other week there is an incident about data leak or
internet scam. While credit card data theft is a real nuisance for
victims, it does not reach catastrophic or life-changing levels
thanks to the AI-enabled protections that limit loss. Identity
theft can ruin a person’s well-being and completely disrupt their
life.
Categories to Consider
● behavior change and manipulation
A variety of AI and machine learning techniques help to change
and manipulate individual behavior. People are most aware of
this when they see ads follow them around – for example, that
mobile cover you searched for, now, keeps popping up on
different websites you open on the browser. Businesses include
predictive modeling to micro target advertising at low cost.
Social media affiliate filtering algorithms maintain useful
business and social collaboration and advisory systems help you
find products of potential interest. It is also possible to show
catastrophic and dangerous things for almost every positive and
constructive application of this algorithm.
35
● Deception – surface and deep.
The desire to create havoc lies dormant in many people. The
tools to wreak havoc are presented to them by the internet – the
extremely luring and dangerous tools of video, voice and image
synthesis. The technology has created videos and voice
recording tools that mimic real people so closely that they are
indistinguishable. Imagine the financial loss that can be caused
by fabricated suggestions from banking relationships that rely
on trusted communications or fabricated information by
political leaders.
The most difficult thing is not understanding ethics but maintaining
the junction between ethical ideas and practice. Data scientists and
software developers do not want to harm the people who use their
products, at least not intentionally. The ones doing intentionally are
called con artists and criminals. Defining “fairness” is difficult and
potentially impossible given the many overarching layers of “fairness”
we may deal with. The problem we want to deal with is: how do we
apply ethical principles in practice? If it does not affect every day
practice , then ethical principles are worse than useless. It is a big
challenge for data scientists, regardless of if they are working for
leading-edge AI or are just classical data analysts. We need to work to
build the software systems that implement fairness. That is doing good
data science. Universal code will advise against collecting data from
experimental subjects without informed consent. But it does not guide
how informed consent should be implemented. It is easy to understand
36
how informed consent has to be implemented when we are talking
about interviewing a bunch of people for some psychology
experiment. But it means something different when you search some
product online and then ads related to that item start popping up
everywhere. Imagine how many customers would be lost if agencies
started using a pop-up for asking permission to use their choice in
targeted advertising. Informed consent means something different
when you enter your address pin code on a shopping website and they
might (or might not) use your data for any number of experimental
purposes that you are not aware of but gave consent while clicking
that box full of fine text. Do you pop up a consent, which in simple
words translates to, ‘we will use your data, but we don’t know for
what’ in fine text and hide it on the web page where it would be
difficult to find. This is the type of question we need to answer and
find the best solutions for. Implementation of the ethical principle
includes everything from user experience design to data management.
How do we manage any sensitive data that we acquire from the user?
Data about race, gender, disabilities, or other protected classes should
not be collected. Sounds logical right? But we will have trouble
testing if the applications we created are fair to minorities or not, if we
skip gathering that data. We will discuss this in detail in the next part.
37
2.3 THE FIVE Cs
A good data product or service requires not only a useful product or
service that is commercially viable, but also the one that uses user data
ethically and responsibly. We often focus more on a product’s
technology or its user experience than the details about how we can
build a data product responsibly that puts the focus on the user. These
products are actually needed. Although, while talking about data
breach Facebook has received the most attention, lack of trust isn’t
confined to a single platform. Lack of trust and data security issues
ranges from large traditional retailers, to data collectors and brokers in
industry and covers even the government. Abused by malicious ads,
and fake and misleading content are the prime reasons for losing
user’s trust. Decisions made by a system that was trained on biased
data can lead to insurance claim denial or unapproving for a loan even
though the user may be innocent and have a clean history. The
Economist proclaimed, “Data is the new oil, hence it is valuable.” The
public provides the data under the assumption that the public will
benefit from it. We also assume that data will be collected and stored
responsibly, and data suppliers will not misuse it. When broken down
to bare minimum it’s a model of trust. There is no point in pretending
to be trustworthy when your actions have proven otherwise. It takes
time to regain the broken trust and the only way to do it is to be
trustworthy. The golden rule for data: “treat others’ data as you would
have others treat your own data.” However, applying it in research and
development is challenging. The golden rule isn’t self sufficient it
needs guidelines for actual implementation. There are five framing
38
guidelines that help us think while building data products. These are
called the five Cs:
1 Clarity
2 Consent
3 Consistency
4 Control (and transparency)
5 Consequences (and harm)
● Clarity
Clarity is entwined with consent. Users need to have a clear idea about
what data they are providing, what will be done with the data and
what the downstream consequences of using their data will be.
Oftentimes, detailed information of what data is collected or sold and
how it will be used are buried in long legal documents, which are
never carefully read if they are read at all. Facebook users playing the
“This is your digital life” game by Cambridge Analytica may have
realized that they were providing their data by answering questions
and these answers certainly were stored somewhere. But did they
understand how that data could be used? Little did they know that they
were giving access to their friends’ data behind the scenes, that’s
buried deep in Facebook’s privacy settings. Another example: when
twitter users tweet, they know their tweet is publicly accessible but
many are not aware of the fact that their tweets can be stored and used
for research or even for monitoring them. The need of ethical data
39
science is not just about getting consent, but also informing users what
they’re consenting to. Now that is clarity.
● Consent
You can’t establish trust between the people providing data and the
people using it without agreeing on what data is collected and how
that data will be used. Agreement starts obtaining consent to collect
and use data. Unfortunately, the agreements between a user service
and the service itself lack clarity and are binary i.e. you either accept
the whole thing or decline everything. In business, when negotiations
on an agreement between two parties begin, there are several redlines
before the agreement is finalized.
Unlike in business, when a user is agreeing to a contract with a data
service, they either accept the terms or they don’t get access. It is not
negotiable. For example, while checking into a hospital for a
treatment, you need to sign a form which gives them the right to use
your data. There’s no way you can figure out for which purposes your
data can be used. And what happens to your data when one company
buys the company that you originally provided your data to? It is not
uncommon that data is collected, used, and sold without consent.
Google collected data from cameras mounted on cars to develop its
mapping products. Samsung collected voice recordings from TVs that
respond to voice commands. Cable set top boxes can collect data
about their users, and once data has escaped, there is no recourse. You
can’t take it back. Even if an organization is ready to delete the data,
it’s very difficult to prove that it has been deleted.
40
● Consistency and Trust
Trust requires consistency over time. You do not trust someone who is
unpredictable or someone whose actions do not match their words.
Long period of consistent behavior is required for restoring trust.
Consistency, and therefore trust, can be broken either directly or
indirectly. Intentionally or unintentionally, an organization can expose
their user data. We’ve seen such incidents with Yahoo where
customer data was stolen in the past years. Local hospitals,
government data, data brokers are also added to the list making it
longer each day. Failing to safeguard their data leads to broken trust of
the consumers. Frustration, anger, and surprise are seen in users when
they don’t know what they have agreed to. Facebook initially claimed
that Cambridge Analytica use of Facebook’s data to target vulnerable
customers with highly specific ads, was not a data breach. And while
Facebook was technically correct, the public’s perception was breach
of trust, if not a breach of Facebook’s perimeter. People failed to
understand their user agreements, complex privacy settings, and also
how Facebook would interpret those settings.
● Control and Transparency
Would you like to know how your data is being used? Would you like
to control how services are using your data? For example, Facebook
asks for your political views, religious views, and gender preference.
Although providing that information, you fill up the details. After a
few years you changed your views and would like to delete your data.
Are you sure that data you deleted from your profile is deleted from
everywhere or is it still stored somewhere? Were you aware that your
41
data is used by sites like Facebook, Instagram to provide relevant ads
or similar posts. Have you come across apps that ask you for specific
permission that are not related to their functioning but do not function
properly when these permission is not granted? Oftentimes, users are
given all-or-nothing choices, or a convoluted set of options making
controlling access overwhelming and confusing. It’s often impossible,
if not difficult to reduce the amount of data collected or to have data
deleted later.
● Consequences
As data products increase in sophistication, and have wider societal
implications, it is necessary to raise awareness about whether the data
that is being collected has potential to cause harm to an individual or a
group. Risks can never be eliminated completely. Due to potential
issues around data usage, laws and policies have been put in place to
protect specific groups: for example, the Children’s Online Privacy
Protection Act (COPPA) protects children and their data.
Unfortunately, these laws have not been updated. Even altruistic
approaches can lead to irrational and harmful consequences. When
AOL released anonymized search data to researchers, a couple of
years ago, it was possible to “de-anonymize” the data and identify
specific users. Pokemon Go, a game that took the market by the storm,
had access to your location and provided real time animated maps
similar to the actual map of the location. Had the data been breached,
it was easy to gain access to almost the whole interior of a particular
country. While Pokemon Go and AOL created a series of unexpected
consequences by releasing their data, it’s important to understand that
42
their data is likely to be dangerous even if it was not made public.
Frequent aggregation of data sets is far more powerful and dangerous
than anything you get from your own data set. For example, running
routes data if combined with data from smart locks, it could tell
thieves at what time a house or apartment and for how long it was
unoccupied. An attacker could steal this data and the company
wouldn’t even recognize the damage. Well-intentioned data scientists
wanted to help others. The problem is that they did not consider the
consequences and the potential risks if the data fell in wrong hands.
Other companies like LinkedIn’s Economic Graph project and Google
Books’ ngram viewer have successfully opened up data for the public
benefit. Many data sets that could offer huge benefits remain locked
up on servers. The traffic related data from ride-sharing and
GPS/mapping companies can transform approaches for traffic safety
and congestion. But careful planning is required before opening up
that data to researchers.
2.4 IMPLEMENTING THE FIVE Cs
Our lives can go from mundane to amazing by just proper use of data.
Recommendations for movies, matching your taste accurately aren’t a
bad thing; if we could collect medical data from patients around the
world, some significant progress in treating diseases like cancer could
be made. But while getting good movie recommendations or better
medical treatments, we want to ensure that the five Cs are
43
implemented. We want to make sure that data collected from us for
these services is safeguarded. Over the past decade, the software
industry has made significant efforts to improve the user experience
All this work has produced results: overall using software has become
easier and enjoyable. “Growth hacking” focuses on getting people to
sign up for services through viral mechanisms. We’ve seen few
product teams that strive to develop a user experience that balances
instant experience with long-term values. In short, impacts of the five
Cs have not been considered by product teams. For example, how will
an application inform users about how their data will be used, and get
their consent? This part of user experience cannot be ignored. Also it
cannot trick users into giving their consent. It’s all part of the total
user experience. Users need to understand what they are consenting to
and what are the consequences of that consent; without it the
designer’s job is not complete. The five Cs are not only the
responsibility of designers but also of the entire team. The same case
applies for the product managers, business leaders, sales, marketing,
and also executives. Every organization’s culture should have the five
Cs. Product and design reviews should go over the five Cs regularly.
Developing a checklist before releasing a product to the public should
be considered by the team. The same applies for well-established
products. New techniques could have developed that can harm us
unintentionally. In short, it is about taking responsibility for the
products that are built and that are already in the market. The five Cs
are a mechanism that ensure the products do not harm us.
44
Check your progress
Fill In the blanks
1 _____________ _______________ takes data and helps a
company in making a decision.
2 Failure to handle data ethically can have _________________
___________________on people.
3 _____________________ __ ________ ________ problem
usually arises in areas where various computer models are used
in making predictions.
4 The most difficult thing about ethics is maintaining the junction
between _______________ and _____________.
5 The tools to ______ ________ are presented to the people by
the internet.
6 The agreements between a user service and the service itself is
______.
MCQ (one correct answer)
1 Data science can be broken down into how many components?
a. Three
b. Five
c. Two
d. Four
2 How does Identity theft affect victims?
a. Causes real nuisance for victims.
b. It does not reach catastrophic or life-changing levels.
45
c. Ruins a person’s well-being and completely disrupts
their life.
d. None of these.
3 We are witnessing increasing automation in:
a. Data cleaning, data preparation
b. Other data-science drudgery
c. Neither a nor b
d. Both a and b
4 Which tools to wreak havoc are presented by the internet
a. Video and image synthesis only
b. Voice synthesis
c. Neither a nor b
d. Both a and b
True or False
1 Consistency, and trust, can be broken either intentionally or
unintentionally.
2 Lack of trust is confined to a single platform.
3 Clarity and consent are totally different terms with no
connection between them
4 When a user is agreeing to a contract with a data service, the
contract is non negotiable.
46
5 Due to potential issues around data usage, laws and policies
have been put in place to protect specific groups and have been
updated in regular intervals of time.
Activity
Take one example from your surrounding that raises ethical concerns
and find solution to address it.
Summary
Data science can be broken down into three components:
1 Business intelligence
2 Decision science
3 Machine learning
Undoubtedly, the future will be completely driven by Machine
Learning and Data Science forms the epicenter of this feature. Failure
to handle data ethically can have serious repercussions on people and
can lead to a loss of trust in products, projects or organizations. In
simpler words, data science is about testing, machine learning for
decision making, and data products in the tech industry. We are also
witnessing increasing automation in data cleaning, data preparation
and other data-science drudgery. Data Ethics is solely in the hands of
data scientists and will not be modified any time soon. In the ethical
47
debate everyone associated with collecting and controlling data should
have a voice about how data should be used.
Challenges and categorizing them
Ethical challenges arise when opinions change on what is considered
right and wrong. The major ethical challenges related to data and data
science are related to reinforcement of human biases, lack of
transparency, consent and power, privacy and unfair discrimination.
Users of Algorithms
Another way to categorize ethical issues in data science is by people
using these methods.
● Business: The public receives a value for services in return for
using their data. In addition, they do have options in configuring
what they actually share.
● Government: Comprehensive and detailed individual data
collection enables officials in some non-democratic societies to
have uncontrolled and destructive control over individuals and
communities.
● Personal data: While credit card data theft is a real nuisance for
victims Identity theft can ruin a person’s well-being and
completely disrupt their life.
48
Categories to Consider
● Behavior change and manipulation
A variety of AI and machine learning techniques help to change
and manipulate individual behavior. Social media affiliate
filtering algorithms maintain useful business and social
collaboration and advisory systems help you find products of
potential interest. It is also possible to show catastrophic and
dangerous things for almost every positive and constructive
application of this algorithm.
● Deception – surface and deep.
The desire to create havoc lies dormant in many people. The
tools to wreak havoc are presented to them by the internet. The
technology has created videos and voice recording tools that
mimic real people so closely that they are indistinguishable.
The Five Cs
A good data product or service requires not only a useful product or
service that is commercially viable, but also the one that uses user data
ethically and responsibly. We often focus more on a product's
technology or its user experience, than the details about how we can
build a data product responsibly that puts the focus on the user. There
are five framing guidelines that help us think while building data
products. These are called the five Cs:
49
1 Clarity
Clarity is entwined with consent. Users need to have a clear
idea about what data they are providing, what will be done with
the data and what the downstream consequences of using their
data will be. The need of ethical data science is not just about
getting consent, but also informing users what they’re
consenting to. Now that is clarity.
2 Consent
You can’t establish trust between the people providing data and
the people using it without agreeing on what data is collected
and how that data will be used. Unlike in business, when a user
is agreeing to a contract with a data service, they either accept
the terms or they don’t get access. It is not negotiable. Once
data has escaped, there is no recourse. You can’t take it back.
3 Consistency and Trust
Trust requires consistency over time. You do trust someone
who is unpredictable or someone who’s actions do not match
their words. Long period of consistent behavior is required for
restoring trust. Consistency, and therefore trust, can be broken
either directly or indirectly.
4 Control and Transparency
Would you like to know how your data is being used? Would
you like to control how services are using your data? users are
50
given all-or-nothing choices, or a convoluted set of options
making controlling access overwhelming and confusing. It’s
often impossible, if not difficult to reduce the amount of data
collected or to have data deleted later.
5 Consequences
As data products increase in sophistication, and have wider
societal implications, it is necessary to raise awareness about
whether the data that is being collected has potential to cause
harm to an individual or a group. Risks can never be eliminated
completely. Frequent aggregation of data sets is far more
powerful and dangerous than anything you get from your own
data set.
Implementing the Five Cs
Impacts of the five Cs have not been considered by product teams.
Users need to understand what they are consenting to and what are the
consequences of that consent; without it the designer's job is not
complete. The five Cs are not only the responsibility of designers but
also of the entire team. In short, it is about taking responsibility for the
products that are built and that are already in the market. The five Cs
are a mechanism that ensure the products do not harm us.
51
Keywords
Machine Learning: The ability of machine to learn from past
data and make predictions about future data.
Reinforcement: Strengthen and/or support
Decision Science: It is the collection of quantitative techniques
used to inform decision-making at the individual and population
levels.
Self Assessment Questions
1. Describe in short each of the five Cs.
2. What are the ways of implementing these five Cs?
Answers to Check your Progress
Fill in the blanks
1 Decision science takes data and helps a company in making a
decision.
2 Failure to handle data ethically can have serious repercussions
on people.
3 Reinforcement of Human Biases problem usually arises in areas
where various computer models are used in making predictions.
4 The most difficult thing about ethics is maintaining the junction
between ethical ideas and practice.
52
5 The tools to wreak havoc are presented to the people by the
internet.
6 The agreements between a user service and the service itself is
binary.
MCQ (one correct answer)
1 a
2 c
3 d
4 d
True or False
1 True
2 False
3 False
4 True
5 False
References
1. https://hbr.org/2018/08/what-data-scientists-really-do-
according-to-35-data-scientists
2. Ethics and Data Science by Mike Loukides, Hilary Mason and
DJ Patil
53
History, Concept of Informed Consent
UNIT
Structure:
3
3.1 Introduction
3.2 Informed consent in data science
3.3 Limitations and challenges
3.4 Three key lessons
3.5 Conclusion
Summary
Keywords
Self Assessment Questions
Answers to Check your Progress
References
54
3.1 INTRODUCTION
Respect for individuals is a fundamental value for any concept of
research ethics - although how to earn respect in practice is an
ongoing question. In the late 19th and late 20th centuries, “informed
consent” emerged as a specific way to increase respect in the context
of medical, behavioral, and social scientific research. Today, these two
concepts - respect for individuals and consent to information - have
become literally inseparable. At the same time, they have been
challenged, if not explicitly, by research activities created by
increasingly advanced networking information and communication
technology (ICT) and a wide range of socially and personally
identifiable data. The rise of “Big Data” has raised new and
compounding questions about the ethical behavior of human subjects
in research. As a result, many researchers and private companies have
avoided or ignored completely the basic research ethical ideas, such as
“informed consent,” instead relying on instruction and consent
methods that generally do not provide the same basis in research
ethics or even ethical theory. This often results in vague or
impenetrable privacy policies and data usage notices. Informed
consent is not a fixed or rigid concept, but a system that is constantly
evolving with specific mechanisms and methods of research to
regulate it is always fluid. Details of the landscape of past and future
challenges posed by forms of research as the new challenges posed by
the rise of industry research and data-intensive ICTs, shows how the
history of informed consent can help shed light on relevant
55
developments and debates for thinking about data and research ethics
today.
The national commission met in 1976 and in 1979 the Belmont Report
was published. The three major ethical principles of the report, respect
for persons, beneficence, and justice; are considered fundamental for
ethics in human subjects research. In honor of individuals, the report
states that “individuals should be considered autonomous agents” and
that research should be conducted on those subjects “voluntarily and
with sufficient information”. The report provides specific guidance on
how to use the principle of respect for individuals through the consent
process. Focusing more on the information component than on one's
own consent, “adequate information” refers to the research process,
objectives, risks and expected benefits and a statement welcoming any
question or withdrawal from the research at any time. Further, the
report focuses on information presentation aspects that can adversely
affect a subject's ability to make a well informed decision. Lastly, the
report states that in some cases it may be necessary to conduct written
tests or oral examinations in order to ensure that the subjects are well
informed. It is worth noting that the report acknowledges the
importance of voluntariness and makes it clear that consent is only
valid if it has not been forced or unduly influenced.
The Common rule states that “information on the subject must be in a
language that the subject or representative can understand, and that
should not disregard the legal rights of the subject or the researcher.”
Efforts are underway to amend the Common rules to accommodate
non-medical research and especially data scientific research. However,
56
it should be noted that there is a known loophole where protocols can
be broken down so that academic researchers not involved in data
collection will be able to avoid IRB reviews in their in-house
institutions and industry partners will be able to conduct data
collection on human subjects freely. Facebook used this loophole for
emotional contagion study. Informed consent must also be
documented through a written consent form approved by the
organization's review board, and this form may be read, signed, or
submitted orally until a witness is present.
3.2 INFORMED CONSENT IN DATA SCIENCE
The concept of informed consent was developed in the context of
experiments that would be conducted on human subjects, and data
would be collected prospectively after a consent had been obtained. It
is based on the principle that the party that might potentially be
harmed (the human subject) will have to decide whether it is a balance
between the benefit society will get and the way in which reward or
money or anything else these subjects were getting, is worth the risk
they are being put at.
In 2012, researchers at Facebook and Cornell University conducted an
experiment where the news feed of selected Facebook users was
manipulated. The users of Facebook did not know that their newsfeed
was being manipulated. They did know that Facebook regularly
tinkers with the algorithm to choose items that are shown in their
newsfeed. And they did know that facebook suggested them things in
57
their news feed based on what it thought was good for them. But what
they didn't know was that what Facebook claimed was good for them
to see was potentially manipulated for the purposes of this research
study. Once the results of this experiment became public, Facebook
got a lot of bad exposure. Companies should not lie to us on purpose,
even if the effects the lie has caused are not different from what they
would be in case they used an ethical process. There's a company
called OkCupid, which is a matchmaking company that also
conducted experiments reporting compatibility scores that were
different than they had actually estimated in their algorithm. And they
really felt that they're doing the right thing.
Let’s take a look at how this translates into our world today and what
data scientists do. First, the experiments are not often experiments,
there is a first collection of data, and the experiment comes afterwards.
And data collection is often done by people who are interacting with
someone who wants to collect this data, most likely a merchant,
software vendor etc. The information provided in this case is usually
hidden in multiple pages of fine print. You want to use some service
and you are given a lot of pages covered with fine print that you have
to agree to, to use their service. There is no negotiation. Anyone who
gets this kind of consent has some advantage in law, but it is far from
having clear, explicit permission. So there is a long history that one
can look at in terms of legal precedent and extract from it to
understand the benefits of obtaining consent for data collection. And
one can understand what being informed really means. And from an
ethical point of view, aside from the law, we all can agree that there is
some weirdness in claiming that someone provided the information,
58
because they were given so many pages of fine print that they didn’t
really have a chance to read. The optional idea is also a bit dubious.
This is because consent was being obtained at the time when a user
intended to perform a specific action, such as using a software service
or purchasing a product. It's not something they have the time and
opportunity to think about, or something they've been shown early in
their shopping experience where they can worry about what they need
from different vendors. And they were able to convince them of their
choice of vendor or product or service. Instead, their decision-making
process is too late, after they have already decided what they really
want to do. And now all of a sudden, after deciding to take a certain
route, you have been asked to pay for this toll or you can’t go this
route. If one looks at how informed consent works, and think about
the example of what Facebook does. And how one would think about
a research experiment that Facebook may conduct to do, say, some
psychology study. Facebook explicitly states in its agreement to the
user that it may collect user data for research purposes. Certainly, it
has been doing so since 2012 after its famous mood contagion
experiment. And so Facebook may have met the letter of its user
agreement, but it still got hammered by its users. And the point here is
not that there's something wrong with Facebook. Facebook does a
great job of asking questions about privacy and what users want to do
with their data. It's just that Facebook is leading the curve. It's a big
company and has collected a lot of personal data in it. And often gets
into the crosshairs of the community before anybody else, because
they're there before anybody else. That is informed consent with
regard to data collection. But then you should have a question about
59
the data you provide, what is it actually going to be used for? So you
may give data about myself to a merchant to obtain a specific service.
You don't want businesses to use this data for other purposes. For
example, you don't think they should use it to sell other things relevant
to your recent purchase. You may want them only to use it for the
specific service that you have contracted with them for. And you don't
want the merchant to share these data with other users. So you may,
for instance, give them a consent to disallow repurposing. And this is
where you are saying, you may collect this data from me, but the data
that has been collected is something where permission in a particular
context only has been given. That context is important and you may
not use it in any other context or sell it to somebody else. Now data
recovery is not a bad thing at all, businesses often need to do it. That's
why your credit card company definitely needs to collect data about
my purchases and my payments. They don't need to share this data
with a credit reporting agency, and you probably don't like sharing it
with a credit reporting agency in particular. But it’s something I got to
accept as part of a social setup. And that's one thing the credit card
company needs to tell me, and I have to agree to. If I am being given
credit, I need to participate in the environment with a credit report.
Reusing, even if it doesn’t require a business, can really be a social
value. You share your medical data in the hospital to get better
medical care, but you do not mind it. You will be happy to know that
the progress of your illness, health, may help future generations to
recover some sufferings. The thing is, specific research questions will
be asked, and I probably don't know when scientists want to study
which information. The questions come later. Data was already
60
compiled and this is called retrospective data analysis as opposed to
prospective data collection. And the problem here is that without
having given enough information to the human subject, how will they
know what they are consenting? And yet you have enough consent for
a detailed understanding of the potential research questions that one
might want to ask.And so it is a balanced action in terms of having the
consent of the information with enough information to be meaningful.
In conclusion, most of the interesting data, most of the data that we
can analyze are man-made, are about humans or have an impact on
humans. When we study data science we should consider this result
and this is the result which is based on the ethical study of data
science.
For this reason, the Institutional Review Board (IRB) has been set up
because it is difficult to explain all the details of the advantages and
disadvantages on human issues. The IRB will pay attention to the
study of the human subject and will examine the harm against the
benefit, ensuring that the suggested consent principles are properly
followed. And this circle has a variety of members, including non-
scientists.It includes some scientists who can figure out what the value
of science is, but non-scientists also represent society in a broader
sense. The Institutional Review Board has to approve the study.
61
3.3 LIMITATIONS AND CHALLENGES
For many qualitative researchers, “standardized and formal methods
are not sufficiently compatible in the social context in which
qualitative research is now conducted.” For example, the practice of
fully disclosing research risks and obtaining clear, written consent at
the very beginning of a research project promotes the ethical behavior
of researchers developing field sites during long-term projects. Such
projects where consent obtained initially may be broken intentionally
or unintentionally, requires constant negotiation from time to time. In
these cases, the researcher's sense of responsibility for his or her
research topics is more important than one-time consent at the
beginning of the study. Furthermore, simply “adjusting” to please the
ethical review board and meeting administrative requirements given
their impracticality puts their researchers in a potential duplicate
position to work for their IRB. In the 1990s and 2000s, subdomains of
Internet research ethics began to address consent issues in digital
environments, where “concealment” (i.e. the possibility of
participatory behavior without recognizing one's presence) is possible
and there are public / private distinctions that cannot be easily
removed. Even though participants can be identified and contacted,
the lack of research and subject interaction makes it more difficult to
assess participants ’form or effectively assess potential risks.
Although some recommendations suggest adding “click to accept”
buttons to digital consent forms to help advertise information, this
only highlights the limitations of consent forms in the context of new,
online research. Further, there are additional logistical challenges
associated with obtaining consent from research conducted online,
62
especially when the subject covers a wide range and / or when a
dataset is publicly accessible, making it more difficult to track
individuals in the data. On the one hand, what is embodied in this
model is the idea of the research participants “rational, personal,
modernist and Western - subjects for whom the concept of consent
indicated by the signed form and its documentation is unpleasant”.
New Challenges and Changing Research contexts the tension between
the people who work or are involved in this system. The study of these
information systems has forced regulators and ethics to reconsider
human issues in the same computer, statistical and data scientific
research. There are growing opportunities in “big data” research to
uncover new knowledge through feedback or desired movements on
the online environment and platforms, or to create more personal (and
perhaps more accurate) assumptions. Research on statistics on human
subjects has moved out of the academy as private companies and
technology companies invest more and more in internal research to
develop products and engage users and develop in-depth knowledge
of human behavior.
Social media company Facebook, for example, an important part of
the company’s culture is of educational research, which includes an
active data science team with a broad connection to education. From
microblogging site Twitter to dating site OKCupid to the transport
network Uber, other companies easily advertise and share details
about their research. In addition, with the Internet of Things (IoT), big
data and world-related expressions can be part of revealing new
insights, incorporating multiple usage cases, and connecting to many
other products and services within the same company.
63
Alternatively, future acquisitions and consequent mergers of user data
suggest how separate information streams may merge in the future.
The user interface, information and the increasing number of
opportunities to interact with various aspects of individuals’ lives is an
unprecedented ability to study our public and private lives. These
challenges have been particularly evident through controversial online
experiments that have led to widespread conversations and concerns -
particularly Facebook’s Emotional Infection Experiment and
Okekupid’s Face Ranking Experiment. In particular, there has been a
great deal of discussion in both the public and academic spheres
around the ethics of experiments designed to manipulate emotionally
without the consent (the Facebook study) or without post-
experimental abbreviations. This is mainly a result of historically
economical and interesting motivations of research done in the
industry, where A / B testing, utility studies and other methods are
commonly used as methods to improve products and services for the
benefit of users and profit margins.
Current investigations into how users engage in products and use
technology - which are always planned and conducted by highly
educated staff or partner organizations - obscure this historically
distinct line. Big data methods ignore not only the user's methodology
with a particular product but also general insights about human
behavior - either through observational studies or through in-depth
intuitive assumptions taken experimentally - there is a distinct shift in
internally organized purpose and opportunity. Research. But the
approach of the private industry to the surrounding questions with a
general knowledge of human behavior presents new questions about
64
how to use the ethics of research on human subjects in this
environment. Of particular interest are the new challenges to the
protection of individuals and groups, as large-scale industry-driven
behavior-research and development risks compromising privacy and
compromising the security of user data.
Currently public and private organizations are left without strict
guidance on how to responsibly implement research and development
using potentially sensitive datasets. New discussions in the industry
have highlighted some attempts to self-regulate through an internal
review process, such as Facebook's recently developed review, which
serves as a research observation on existing organizational
infrastructure and offers several opportunities to flag specific projects
for review. Things like the Facebook review process, however, are
just beginning to emerge.
Until now, however, the lack of clear or unambiguous research code
of conduct for data science and industry research has created a policy
vacuum that is filled by (and does not bear much resemblance to) the
legal system of privacy policies. Unfortunately, critics and scholars
alike combine “informed consent” with information and consent
processes, and many legal discussions about informed consent on
improving privacy policies and information readability in order to act
as information consent in research.
It is not only about research ethics to divert research and data subjects
towards oneself in specific ways, but also about explaining and
applying ethical ideals in a non-ideal world. The basic ethical
principles introduced by the Belmont Commission, especially respect
65
for individuals - led to a strong philosophical debate within the
Commission and were carried out in accordance with the law. As a
result, these regulations refer to research settings and actively woven
ethical principles and existing legal traditions. In contrast, online
privacy policies were practically developed in the late 1990s as a way
for private companies to stop further regulation. As a result, the
increasing case of notification and consent process related to the
collection of data in the Privacy Policy or any data collected online
and offline is driven by law and regulatory precedent and almost any
collection, use, or disclosure once consent is obtained instead of
enabling an individual to self-manage their policies. Furthermore,
researchers, graduates, ethicists, lawyers, and even the public - should
be able to understand research ethics rules. On the other hand, the
description of privacy policies is “written by lawyers for lawyers”. By
consistently responding to ethical violations and evolving in the
scientific, technical, and administrative contexts that shape research
and practice, traditional medical and behavioral research ethics have
demonstrated a certain kind of commitment to important human
values such as respect and benefit.
3.4 THREE KEY LESSONS
Based on the history and challenges presented here, we summarize the
three main lessons that should inform future work and discussion
about online respect and consent, and for data science today.
1. The ideas and legal practices surrounding informed consent are
often confused, but it is important to note that the concept and
66
value of informed consent precedes your existing understanding
of the administrative or bureaucratic process. At various stages of
its early history, informed consent often serves as an informal
way to promote or respect the autonomy of individual subjects.
Often, emphasis was placed on the initial project of the study: a)
voluntariness of the subject and b) mutual agreement and
informal reach between the subject and the researcher. While
some of these may have rightly questioned the dynamics of
power in the initial situation, we can say that there are no strict
policy reasons for the bureaucracy to agree to meet the
administrative process. There is no value limitation to the process
of giving consent as a signed agreement - future consent can be
obtained by audio interaction or other interaction method. Formal
documentation and routine information methods did not become
common until the latter half of the 100-year policy history. At the
time, some ethical players were instrumental in laying the
groundwork for the idea that the ethics of research relate to
administrative requirements and organizational accountability
that do more than build respect in the case of people. To sum it
up: The size and scope of today's informed consent was not an
accident, but was shaped by specific political decisions that
began in the 1950s. Written documentation with the ever-
evolving boilerplate language is not only the only way to obtain
informed consent, but it has been a scalable strategy for national
research designed to address some ethical dilemmas since that
time.
67
2. We should recognize the strong history of pushback and ethical
discussion among social scientists and humanities researchers.
Although the tensions between institutional ethical review and
this type of research have been greatly resolved, there is a history
of decades of discussion and debate that has helped inform and
further develop ethical identities and commitments in specific
academic disciplines - the “big data” social scientists in
mathematics, technology industry settings. Research cannot be
left behind. Informed consent, in particular, has a very long
history, almost 50 years ago, with its earliest manifestations in
medical treatment and in discussions about personal autonomy.
While this may appear on the surface of historical nitpicking,
furthering and deepening the argument, it does not help
researchers in mathematics, computer science, and data science
have an equally strong history of research debate and debate to
which they can appeal. Consequently, the appeal to conclude the
informed consent without regard to this history, represents at the
best, hollow, and at the worst, self-interest.
3. The political history of the proposed consent reveals the
importance in terms of how we protect certain norms for the
values we seek, such as respect for individuals.For example, the
initial focus on the voluntariness of consent reflects the
prevailing concern by limiting the state power of coercion. Given
these conditions, volunteerism (and especially the absence of
coercion) is important to appreciate people’s dignity. From an
emphasis on voluntariness to informed choices, informed consent
in administrative and legal contexts depends on the absence of
68
certain types of information or the imposition of certain levels of
understanding. The growth of “information” as central to consent
is reflected in its development in the NIH and their use of the
necessary information and administrative methods for the
purpose of accountability and responsibility. The order to submit
specific information (e.g., risks and benefits, alternative options,
explicit ability to opt out, etc.) has also been emphasized earlier
on the individual's discretion that he would not be able to make
truly autonomous decisions if such physical information was not
available. Abuse at the academy is still a matter of great concern,
but further discussion of individuals in the context of data
platforms and the digital industry needs to seriously consider
what it means to protect individuals and groups from the growing
pervasive influence of online platforms and technology
companies. We need to ask tough questions about what it means
to advertise “exercise of free power of choice” in the online
ecosystem, with few large Internet companies interested in
controlling information from users, companies and third parties
like advertisers. While the solution to these emerging challenges
may seem different than traditional information consent, the
impracticability of consent when dealing with thousands to even
millions of individuals in the form of data points should not lead
us to dismiss those concerns, or ignore the basic values (such as
autonomy or respect for the people) that consent was constituted
to support in the first place. As Mary Gray quotes, “ethical
dilemmas are often signs that our methodological approach is
stretched too thin and is failing us”. Not only do we need to pay
69
close attention to social, legal and political history, but the
assumptions that violate our moral commitment must be
constantly re-examined in the light of new and emerging social
and technological possibilities.In addition, we need to think
carefully about what it means to gain respect in certain social,
political, or technical contexts. This context and political
complexity is important for how we think about the
implementation of matters of research ethics.We need to pay
attention to the fact that research settings or certain features of
ICT may “physically affect the opportunities and limitations of
individuals, thus excluding or limiting other interests”.
3.5 CONCLUSION
As we have shown in previous Units, information consent is not a
static or intact mechanism, but rather has evolved from early human
subject research methods through coded documents, such as the
Belmont Report (and beyond)that have set ethical standards in policy-
making in national research policy. Today we have to try to retrieve
valuable insights related to the discussion related to data, technology
and research. In particular, we focused on the basic nature of the
proposed consent policies and the way in which they developed the
social and political context in which they operate. This situation is
especially important for thinking about how to realize respect in the
context of online and industry research. However, we do not overstate
70
the importance of informed consent - because, as such, it is not the
only mechanism used to perform the operationalization of respect.
We acknowledge that ethical conversations can be more manageable if
the focus is on consent, but focusing too much on it risks losing the
full power of respect as a guiding value or research policy. As
LeBacqz summarizes: “Just as respect for individuals is diminished
toward autonomy and autonomy to self-determination or freedom of
choice, the logical conclusion is that the broader principle of
‘respecting individuals’ is then broken by the rule of ‘consent to
information’. ” In the lieu of the history mentioned above should be a
lesson that teaches us to express a legal opinion about consent to
information is not the only way to realize important values such as
autonomy or respect. Instead of dismissing consent as unsupported or
not applicable to industry research, we argue that there is a relevance
to the history of thinking about how to gain respect in the context of
online experiments and the kind of research done by Internet
companies like OKCupid and Facebook. Relating to careful historical
analysis with careful theoretical explanations - and seriously
questioning both - is an integral step towards the development of 21st
century research ethics and policy recommendations.
71
Check Your Progress
Fill in the blanks
1. _________________________________ is a fundamental value
for any concept of research ethics.
2. ________________________ is often done by people who are
interacting with someone like a merchant, software vendor etc.
3. A __________________________ uses existing data that have
been recorded for reasons other than research
True or False
1. Informed consent is a fixed and rigid concept.
2. Belmont report states that the individuals should be considered
as autonomous agents.
3. If you plan to conduct a human subjects experiment, and you
get informed consent of the subjects, you can proceed.
Multiple Choice Questions
1. The CBI asks search engines like Google/Yahoo to inform it
about all individuals who search for information on how to
make a bomb. Is it ethical for Google to turn over this
information:
a. Always
72
b. After informing users in a general way that some
searches may be revealed to others as required by law.
c. After a specific warning to users about reporting to law
enforcement any searches viewed as threatening.
2. You have an idea to improve the way in which patient data is
input to electronic medical records, thereby reducing errors and
better integrating data entry with patient care workflow. What
kind of data do you use while running the experiment?
a. Prospective Data
b. Retrospective Data
3. A pharmaceutical start up company seeks to buy information
from Google on searches related to these biological pathways.
Is it ethical for Google to sell this information:
a. Always
b. After informing users in a general way that some
searches may be revealed to others.
c. After informing users that their search strings become the
property of Google, and can be sold commercially.
Activity
A chocolate company Z has learned about Facebook’s mood
manipulation experiment. Therefore, it has designed its web site to tell
heart-warming stories in callout boxes on every page. These stories, at
best, are tangentially related to the products being sold on the page.
73
They A/B test this web-site before launch to see if the story boxes do
have the intended effect. They find that the boxes do have the desired
effect of increasing sales. They then adopt the new website design.
Does Company Z need to inform its customers about this effort? To
what extent? Does it need to obtain consent? If so, for what? If you
answered YES to the consent question above, what is the smallest
change to the scenario described above that would make you change
your answer to NO
Summary
Introduction
In the late 19th and late 20th centuries, “informed consent” emerged
as a specific way to increase respect in the context of medical,
behavioral, and social scientific research. Details of the landscape of
past and future challenges posed by forms of research as the new
challenges posed by the rise of industry research and data-intensive
ICTs, shows how the history of informed consent can help shed light
on relevant developments and debates for thinking about data and
research ethics today. The three major ethical principles of the report,
respect for persons, beneficence, and justice; are considered
fundamental for ethics in human subjects research. In honor of
individuals, the report states that “individuals should be considered
autonomous agents” and that research should be conducted on those
subjects “voluntarily and with sufficient information”. The report
74
provides specific guidance on how to use the principle of respect for
individuals through the consent process. The Common rule states that
“information on the subject must be in a language that the subject or
representative can understand, and that should not disregard the legal
rights of the subject or the researcher.”
Informed consent in data science
The concept of informed consent was developed in the context of
experiments that would be conducted on human subjects, and data
would be collected prospectively after a consent had been obtained.
there is a long history that one can look at in terms of legal precedent
and extract from it to understand the benefits of obtaining consent for
data collection. And from an ethical point of view, aside from the
law,we all can agree that there is some weirdness in claiming that
someone provided the information, because they were given so many
pages of fine print that they did not really have a chance to read. But
then you should have a question about the data you provide, what is it
actually going to be used for? And you do not want the merchant to
share these data with other users. the data that has been collected is
something where permission in a particular context only has been
given. That context is important and you may not use it in any other
context or sell it to somebody else. And yet you have enough consent
for a detailed understanding of the potential research questions that
one might want to ask. And so it is a balanced action in terms of
75
having the consent of the information with enough information to be
meaningful.
Limitations and challenges
Research on statistics on human subjects has moved out of the
academy as private companies and technology companies invest more
and more in internal research to develop products and engage users
and develop in-depth knowledge of human behavior. This is mainly a
result of historically economical and interesting motivations of
research done in the industry, where A / B testing, utility studies and
other methods are commonly used as methods to improve products
and services for the benefit of users and profit margins. But the
approach of the private industry to the surrounding questions with a
general knowledge of human behavior presents new questions about
how to use the ethics of research on human subjects in this
environment. Of particular interest are the new challenges to the
protection of individuals and groups, as large-scale industry-driven
behavior-research and development risks compromising privacy and
compromising the security of user data. Until now, however, the lack
of clear or unambiguous research code of conduct for data science and
industry research has created a policy vacuum that is filled by the
legal system of privacy policies.
Three key lessons
The three main lessons that should inform future work and discussion
about online respect and consent, and for data science today.
76
1. The ideas and legal practices surrounding informed consent are
often confused, but it is important to note that the concept and
value of informed consent precedes your existing understanding
of the administrative or bureaucratic process. While some of
these may have rightly questioned the dynamics of power in the
initial situation, we can say that there are no strict policy
reasons for the bureaucracy to agree to meet the administrative
process. To sum it up: The size and scope of today's informed
consent was not an accident, but was shaped by specific
political decisions that began in the 1950's.
2. We should recognize the strong history of pushback and ethical
discussion among social scientists and humanities researchers.
There is a history of decades of discussion and debate that has
helped inform and further develop ethical identities and
commitments in specific academic disciplines. While this may
appear on the surface of historical nitpicking, furthering and
deepening the argument, it does not help researchers in
mathematics, computer science, and data science have an
equally strong history of research debate and debate to which
they can appeal. Consequently, the appeal to conclude the
informed consent without regard to this history, represents at
the best, hollow, and at the worst, self-interest.
3. The political history of the proposed consent reveals the
importance in terms of how we protect certain norms for the
values we seek, such as respect for individuals. The growth of
“information” as central to consent is reflected in its
development in the NIH and their use of the necessary
77
information and administrative methods for the purpose of
accountability and responsibility. This context and political
complexity is important for how we think about the
implementation of matters of research ethics. We need to pay
attention to the fact that research settings or certain features of
ICT may “physically affect the opportunities and limitations of
individuals, thus excluding or limiting other interest.
Conclusion
We do not overstate the importance of informed consent - because, as
such, it is not the only mechanism used to perform the
operationalization of respect. As LeBacqz summarizes: “Just as
respect for individuals is diminished toward autonomy and autonomy
to self-determination or freedom of choice, the logical conclusion is
that the broader principle of respecting individuals' is then broken by
the rule of 'consent to information'.” Instead of dismissing consent as
unsupported or not applicable to industry research, we argue that there
is a relevance to the history of thinking about how to gain respect in
the context of online experiments and the kind of research done by
Internet companies like OKCupid and Facebook.
Keywords
IRB: The Institutional Review Board (IRB) is an administrative
body established to protect the rights and welfare of human
research subjects recruited to participate in research activities
78
conducted under the auspices of the institution with which it is
affiliated.
Microblogging: Blogging done with severe space or size
constraints typically by posting frequent brief messages about
personal activities.
Self Assessment Questions
1. What is informed consent with respect to Data Science?
2. What are the limitations and challenges faced?
3. Which are the three key principles? Describe them in brief.
Answers to Check your Progress
Fill in the blanks
1. Respect for individuals is a fundamental value for any concept
of research ethics
2. Data collection is often done by people who are interacting with
someone like a merchant, software vendor etc.
3. A retrospective study uses existing data that have been recorded
for reasons other than research.
79
True or False
1. False.
2. True.
3. False.
Multiple Choice Questions
1. b
2. a
3. c
References
1. https://journals.sagepub.com/doi/full/10.1177/174701611559956
8
2. https://www.theguardian.com/technology/2014/jul/29/okcupid-
experiment-human-beings-dating
80
Data Ownership
UNIT
Structure: 4
4.1 Data ownership
4.2 Its meaning and challenges
4.3 Importance of data ownership and ownership policies
4.4 Limits and issues in data ownership
4.5 Levels of ownership
4.6 Data hoarding and destruction
Summary
Keywords
Self Assessment Questions
Answers to Check your Progress
References
81
Objective:
After going through this unit, you will be able to
Understand in detail the concept of data ownership
Understand it’s importance and challenges
Understand various levels of ownership
Analyze other issues related to ownership of the data.
4.1 DATA OWNERSHIP
Ownership of data means possession and responsibility for
information. It also means control and power. Information control
includes not only the ability to access, create, modify, package,
retrieve, sell or remove data, but also the right to provide access to
others. Involved in controlling data access is the ability to share data
with colleagues who promote advances in research (A notable
exception to the improper distribution of data is research involving
human subjects).
So who owns the data? The point is that the data is about you and so
you may think the data belongs to you, but is it really yours? To
emphasise my point let us look at this example: You write your
friend’s biography. So you own copyright on it because you have
written it. And in case your friend does not like it, there is not much
82
he/she can do unless you have written false information and lied in
ways which harms their reputation. In that case you can be sued for
libel. If you photograph your other friend, you own the photo. But
there are some limits on things that you can do. There may be private
areas in which you cannot take the photo. There might be ways in
which you cannot use the photo. You cannot use the photograph for an
embodied recognition. To put it in simpler words, you cannot
photoshop it unless you have the permission to do so.
There are moments in everyone’s life where we don’t look are best or
we are not on our best behavior. And if someone just captured that
moment and then made it public, it's something you might be very
ashamed of. So we have the point of possibly creating an environment
where our careers can be hurt because someone photographed an
unfulfilled moment and used it to judge who we are. When a person
thinks about owning data, the history of what someone did through
things like books and photographs are things that might help. You can
record things about me, and if you have recorded things about me, you
can do whatever you want to with it. There may be some reasonable
limits on what kind of things you can record about me and what you
can do with it but your records are your records. And we have been
doing this for a long time. Remember in your teenage days, at some
point you would capture an embarrassing moment of your close
friends and post it on their birthday against their wishes? Apart from
the electronic world we have letters of recommendation. We had
gossip. And these things have human subjects. And these human
subjects may be affected by the contents of those letters of
recommendation or the gossip, but they really don't have much
83
recourse. To understand how this would work in relation to
intellectual property and the way our society thinks about it, there are
three main types of intellectual property. The word we are talking
about is Copyright.
Copyright is an object where you have a specific artistic expression.
Let’s say you take “Ek Hota Carver” (a copyrighted book in marathi
language) and you translate it in English. Then translation is a derived
function. It’s not entirely original because you didn’t write it from
scratch but it’s not the same as the original work. You have applied
your own creativity and you have tried to create that translation
yourself. What happens with derived works is a complex rule in terms
of how copyright works. You can only create derived works with the
permission of the original work owner. But, once you get permission,
you can do whatever you want. The generated work is yours.
There are many types of intellectual property that will be less
applicable to data. There are some patents that have ideas for making
or doing something innovative and useful for the first time. If we think
about the copyright that applies to artistic expression, we understand
how it works.And the idea of the integrity of artistry. So if you use a
piece of information that does not belong to you, that you haven’t
drafted yourself, you must credit the owner or the website that you got
the information from. The problem is that you don't usually take all
your data and use it. What you do is take a small piece of what you
know, merge it into what others’ know, create something new, and at
the end of the day, all you can say is, “I've used some of your data.”
And there was some input from others that somehow got it done the
84
way you are saying. And you can't really say exactly what and exactly
how much has been certified, and the best you could say is that
everything someone has said to you has contributed to what you are
saying now. And it's a little vague and makes it difficult for people to
get the right credit.
So far we've been talking about credit for what people want to get and
own. Now let us look at the other side of the coin. It holds a lot of
cultural artwork, a lot of heritage, and things that are hard to access in
the real world, and digital data is often a great way to preserve our
culture. So the things that you don't have printed can be lost forever.
Think about ancient scripts and all the vedas. And these are important
efforts to digitize and make it available. Now, not only can digitized
data save culture, it can also propagate culture. Considering how
libraries work, the general idea is that a library buys books and
patrons can go to the library and pick up books. So there is a book
sharing system that enables the library. In a world where we have
abundant communication networks you can have a universal library, a
virtual library, and you can have a digital copy of a book that you can
lend to any library user anywhere in the world. And so through
digitization the library can become a mechanism to spread culture
everywhere in a way that you can’t if you stay in the realm of
traditional technology.
Data collection and organization takes a lot of effort. And even though
we started with things that are free and available for free, a lot of
effort is being made to clean and standardize and collect data and put
it in a form that is easy for everyone to use. Once someone has tried it
85
then they own the collection they have created.It is the result of their
hard work and they can chose to make it public or maybe they can say
they have data assets now and they want to get credit, either in the
form of publicity, money or whatever it is they would love to be
rewarded with. The same thing applies with data that you have
purchased and started with. Note that you don’t actually need an
artistic creator to claim ownership. You can have data that people
have given you freely and then you can put it all together and this
effort that you have taken gives you ownership. A well-known
example of such a system is Wikipedia. Wikipedia is sourced
publicity. It is an encyclopedia that people from all over the world
have come together to create and individual contributors do not own.
Wikipedia has an encyclopedia that they own. Now, they may have
made social decisions and contracted with people who have
contributed that they will make it available to everyone for free, but
they don’t have to. If you look at reputable sites generated by crowds
such as TripAdvisor or Rotten Tomatoes, the business models of all
these companies express their opinions about the businesses that are
fully rated on this site. These votes are given separately by people
who usually do not get compensation if they contribute these votes.
The main point being whether the contributors are reimbursed or not,
the data collected is then the property of the compilers and
aggregators. And this is their collection and their organization and the
efforts they put into creating this collection are important and they are
free to sell ads or make money from the data in any way they deem
appropriate.
86
4.2 ITS MEANING AND CHALLENGES
Data scientists report that on average they spend 80% of their time
cleaning and compiling data, with only the remaining 20% for actual
analysis. As a data scientist, you spend time finding ways to query on
multiple data sets, formatting data to work with different analytics
tools, and converting it to the data you can use by applying as many
changes as you want to the data you have collected.
4.2.1 What does data ownership mean?
Here's what we mean when we talk to companies about owning their
data:
You are free to collect what you want.
You can compile the data in the format you want.
You can trace your data back by traversing through your
pipeline.
Your data can be used the way you want.
The data resides in your own internal servers and platforms.
End-to-end data ownership means knowingly controlling the status of
your data: You are responsible for what data you collect, how you
collect that data, where it goes after collection, and how it gets there.
When you have your data infrastructure, you collect data which is
important to you to learn about your products or customers.
87
Being able to see what happens as you move your data from archiving
to your data warehouse means you can trust it to be accurate, as well
as detect any quality issues before it interferes with your analysis.
After compiling the data, you must have the freedom to make any
kind of analysis or exploit the selected data.
4.2.2 The ownership challenge
The growth of the Internet has led to a shift in attention from physical
products to digital goods - the rapidly growing use of Internet-enabled
companies to create data products. Creating a data product requires
expertise in both Backend Engineering and Data Science. Besides a
few exceptions, most people who work in this field are deeply
proficient in one domain and have acquired moderate knowledge in
another. Products that require experts from a variety of domains are a
common challenge for companies today.
As a data scientist, your main job is to use data to help your
organization achieve its goals. How effectively you can use data
depends on its quality, and your efficiency suffers if you are at the
mercy of vendors and tools you use for collecting the data. Owning
your data eliminates wasting time cleaning it up so you can become a
data scientist instead of a data janitor.
Vague ownership of data products comes with the ambiguous role of
data scientists in cross-functional teams. When it is not clear what
exactly your data represents, any analysis you do is based on
inaccurate data. This can lead to difficulties in collaborating with
88
other business elements, for example marketing teams spend money
on suboptimal campaigns targeting the wrong people.
4.3 IMPORTANCE OF DATA OWNERSHIP AND
OWNERSHIP POLICIES
4.3.1 Importance of data ownership:
Ownership is a fundamental concept based on our daily life and on
fundamental mechanisms of society. It describes the assignment of
rights and obligations for a property to an individual or an
organization. The principle of openness is emphasized by general
consensus of science. Thus, data sharing has many benefits for society
in general and for protecting the integrity of scientific data in
particular. The 1985 report of the National Statistics Committee on
data sharing states that data sharing strengthens open scientific
inquiry, encourages diversity of analyzes and findings, and permits:
1. Restructures to verify or break reported results
2. Alternative analysis to refine the results
3. Analysis to check whether the results are robust to different
assumptions
The costs and benefits of exchanging data should be considered in
ethical, institutional, legal and professional dimensions. At the
beginning of a project, researchers should clarify whether or not data
can be shared under which circumstances, for which purposes by and
with whom.
89
4.3.2 Data Ownership Policies
Institutional policies that do not have specific policies, supervision,
and formal documentation may increase the risk of compromising data
integrity. It is important to delineate the rights, commitments,
expectations, and roles of all interested parties, before beginning any
research. Data integrity compromises can arise when investigators are
unaware of existing data ownership policies and do not clearly
describe rights and obligations related to data ownership. Some
scenarios between interested parties that warrant establishing data
ownership policies are mentioned below:
Between public / private industry and academic institutions: It
refers to the sharing of potential benefits from research by
academic staff through corporate sponsorship. The failure to
clarify the question of early data ownership in public / private
relations has led to controversy over the rights of institutions
and industry sponsors.
Between research staff and academic institutions: According to
one research, funding is provided to research institutes and not
to individual investigators. As a financier, these organizations
are responsible for overseeing a wide range of activities,
including budgeting, regulatory compliance, and data
management. The research institute claims ownership of the
data collected from the funds provided to the organization, to
ensure that they are able to meet these responsibilities. This
90
means that researchers cannot automatically assume that they
can keep their data with them if they move to another
organization. The research institute which has received this
funding may have the rights and responsibilities to control the
data. Institutions are recommended to clearly state their policy
regarding data ownership and submit guidelines for such policy.
Collaboration between research colleagues: This applies to joint
efforts that take place both within and between institutions.
Whether it is collaborations between colleagues, students, or
faculty members, all parties should have a clear understanding
of who controls how the data is distributed and (if applicable)
shared before it is collected.
Collaboration between authors and journals: To reduce the
likelihood of copyright infringement, some publishers require a
copyright assignment to the journal at the time of submitting a
manuscript. Authors should be aware of the implications of
such copyright assignments and clarify the guidelines
associated with them.
Investigators must learn to negotiate the delicate balance between an
investigator's willingness to share data to promote scientific
advancement and a duty to employers / sponsors, employees and
students to maintain and protect data. Non-disclosure agreements
signed between investigators and their corporate sponsors can bypass
efforts to publish data or share it with colleagues. In some cases, such
91
as with human participants, the data exchange may not be allowed for
reasons of confidentiality.
Advances in technology have enabled researchers to discover new
avenues of research, increase productivity, and use data in a way that
has never been possible before. However, the careless application of
new technology can lead to a number of unexpected data ownership
problems that can compromise the integrity of research. While the
ideal is to encourage scientific openness, there are situations where it
may not be appropriate (especially with human participants) to share
data. The key is for researchers to understand various issues that affect
the ownership and sharing of their research data, and to make
decisions that advance scientific investigation and protect the interests
of the parties involved.
4.4 LIMITS AND ISSUES IN DATA OWNERSHIP
4.4.1 Limits on recording data
The fact is that the data belongs to whoever is recording the data. And
to compensate for this, there are limits on its use. There is a certain
expectation of privacy. If you're in a changing room of a clothing store
or you're in a washroom at a mall or theatre, you don't want to have
cameras in it. When you own a phone, you don’t want your telephone
company to monitor all your phone conversations. Of course, there are
exceptions. When you model for a magazine photoshoot, you give
them the ownership of your photos. So on a contractual basis, you
could get the sensible things you want to do. But let us take a look at
92
things that one does without having contracts. So if you go into a
store, you probably have video cameras there, we know they are
installed there for security purposes. You know these cameras survey
all the time. Since the videotaped data is owned by the store, they may
use it to improve their product placement for maximum sales. That is
accepted by all. Suppose you are a famous person and make a quick
visit to such stores. Later you found your video walking down one of
the aisles of the store being viral on the internet, you would consider
that as breach of your privacy. While there is no contract between a
customer and the store, there is a mutual understanding that will
videotape its customers but not release this videotape. If you have a
cell phone, you must have noticed this. The moment you cross your
area, you get a roaming network. This means the cell phone company
knows your location so that you can receive service. And when they
know your location around the clock, they know a lot about you. This
can lead to a huge loss of privacy. How much of your movement are
they allowed to record and how can they manipulate it is something to
ponder about. There is a need to record the data. And with the need
comes the potential for its misuse. To avoid the misuse, we have
limited the use instead of limiting the recording. There is no
contractual obligation for the store and the video camera. The store
believes, and rightly so, that, in terms of social consensus, if they
recorded and posted these videos from the cameras in the store,
customers would be very upset and hence they would not do so. When
it comes to cell phones and mobile applications on cell phones, we are
still trying to achieve social consensus. However, in general,
businesses know that customers get very unhappy when data collected
93
for one purpose is used for another purpose that users did not expect it
to be used for. These agreements can be made in writing. They can be
part of written contracts and then have legal force. And whether they
are actually written down or not, these insights remove the barriers to
many transactions. Next comes the government surveillance. If you
look at our security agencies, they really don't know what they're
going to need. And if they haven't gathered the data, they usually can’t
go back and look for it. And so we actually have this whole facility
where they're compiling in a lot of data and plan to never to look at it.
If there is a specific need, they will seek authorization equivalent to an
arrest warrant to actually view the data. If people can actually record
information and assure that that recorded information will not be
viewed by anyone, either by legitimate individuals because they have
the right to see it or because of security breaches, then we may
actually have an intermediate point at which we differentiate between
data collection and data usage.
4.4.2 Considerations/issues in data ownership:
Researchers should have a thorough understanding of various
problems related to data ownership in order to make better data
ownership decisions. In general, data ownership is a control problem.
Ownership implies that the remaining right of control is the right to
determine those privileges for others.
While ownership rights can be easily assigned, it is difficult to
associate responsibilities with roles because the data is inherently non-
94
rival. Data can be used for multiple purposes. This quirk has led to
debates about data ownership and how it is used and controlled. We
classify the paradigms into three categories according to the socio-
organizational context: individual, organizational and shared
ownership (everyone).
Data is increasingly claimed by individuals as the owner
(subject as owner). With the Internet, personal data is being
collected, used, and even sold in unproductive ways. Hence, the
private ownership paradigm often emerges in response once the
data collection is revealed. Such was the case in the Cambridge
Analytica scandal, which used data from millions of Facebook
users without their official consent. With the advent of the
Internet of Things (IoT), the debate about ownership of
individual data has taken on a new facet, as it remains unclear
who owns personal data produced by machines.
In the context of organizations (company as owner), the data
ownership concept is becoming increasingly complex due to the
distributed data creation and processing in organizations. There
are three main reasons for claiming:
1. Organizations claim ownership based on monetary
financing factors (financing organization as owner) or
purchase / license data. There are always two parties
involved in these paradigms. On the one hand, the
organization that finances the party and creates the data;
95
on the other hand, the organization that purchases or
licenses data from another party.
2. An organization can use data to take ownership. This is
usually the case with consuming parties (consumers as
owners) who need a high level of trust in the data and
therefore assume accountability. This can also apply to
parties who read data from various sources (reader as
owner) in order to create it or add it to their knowledge
base.
3. Companies create business value through data processing
and therefore claim ownership.Depending on the type of
processing, four paradigms can be distinguished:
a. creator as owner
b. packager as owner
c. compiler as owner
d. decoder as owner
In the case of paradigm everyone as owner, is applied when
data is intended to be shared with a broad user group. In this
case, data ownership will not be transferred to any individual or
organizational party. Instead, everyone can own certain data
with the same access rights.In particular, when the data is
created in a crowdsourced manner - Wikipedia, for example -
the community is the owner of the data and everyone has the
same rights to access and use the data under certain restrictions.
However, open data repositories require data management,
which is often difficult to determine when responsibilities are
distributed and cannot be assigned to an individual or
96
organizational unit. While Open Access holds the potential for
great innovation, problems arise with data protection,
confidentiality and control of data.
4.5 THE LEVELS OF OWNERSHIP
In practice, from a DS perspective, we have seen four levels of
ownership for data products.
Level 0: No ownership
The data scientist acts as a consultant for the algorithmic components
of the data product. Once they are completed, their ideas are
implemented and produced by the owner team. This approach can
only work if the data science part is simple enough for non-domain
experts to understand And if there is no further iteration on the
required model.
Level 1: Owning the DS prototype
The data scientist owns the original algorithm of data production.
They are free to use different data sources, test different models and
optimize logic in any way according to their needs. Once the data
scientists received satisfactory results, they entrusted their work to an
engineering team that recreates the data scientist's product.The
challenge with this approach is that the production system may be
quite different from the environment of data scientists.
97
Programming languages (and their libraries) may vary, depending on
the way the data is used, and performance requirements may vary and
the data scientist may not be made aware of these differences.
In short, the implementation of the data science’s idea can deviate
significantly from the original idea itself. Furthermore, long correction
cycles were included in this process because fulfillment of the
prototype is a problem.
Level 2: Owning the data science part
The algorithm is solely owned by the data scientists. Their
implementation resides in containers in the production system. The
container carries out all the pure backend engineering work. Data
scientists can rely on properly defined input and output interfaces
without worrying too much about what happens outside of their
scientific responsibilities.
The challenges of this setup are the increased responsibility of data
scientists to build performance algorithms and the reliance on
engineers if the interface needs to be changed. Moreover, important
components of data production, such as tracking or monitoring, still
have unclear ownership.
Level 3: Owning the data product (Data Science & Backend
Engineering)
Data scientists have full ownership of the algorithms and complete
control over the engineering part of data production. In reality, this is
achieved by having a data scientist and a backend engineer, sharing
the same goals, in one team. In this setup, data scientists can optimize
98
the algorithm and engineers can optimize the surrounding backend in
any way they want.
The only thing constant between the different levels is the split
between owning the algorithm and owning the backend framework.
The challenge to be solved, therefore, is not to standardize ownership
of one person, but to minimize the number of different teams that own
different parts of the data product.
This is the reason companies create teams that are able to own
products end to end. The same applies to data products. If your data
products are hard to create and iterate, you may have a problem with
your ownership structure.
4.6 DATA HOARDING AND DESTRUCTION
4.6.1 Data hoarding:
This practice is considered to be contrary to the general norms of
science which emphasize the principle of openness. Factors affecting
the decision to block access to data may include:
Concerns regarding ownership, financial or security
Data logging can be extremely expensive and time consuming
To provide all the materials needed to understand or enhance
research
Technical barriers to computer readable data sharing
Privacy
Concerns about the eligibility of data requestors
Withhold data due to personal motives
99
Price to borrowers
Costs to funders
4.6.2 Data destruction:
Companies are entitled to collect data in order to do business with us.
Companies don't want to upset customers. And so, sometimes they can
do things that annoy customers, but usually that's because they just
didn't realize they were going to annoy customers that much. And that
gives us some degree of assurance that they won't do terribly bad
things. Unfortunately when a company goes bankrupt, the data is an
asset. All assets are for sale, their assets are sold like any other asset to
third parties. As a corporate data contributor, you want the data
collected to be destroyed, not sold, if the company goes bankrupt
when you do business with them. Bankruptcy law is now actually
partially protected and basically says that regardless of the company's
privacy policy, it must survive even after the company goes bankrupt.
In summary, owning data is very complex. It's a lot harder to think
about than owning intellectual property on other things like works of
art.
Check your Progress
Fill in the blanks
1. Ownership of data means ____________ and ______________
for information.
100
2. Creating a data product requires expertise in both ___________
______________ and ____ ________.
3. Not having specific policies, supervision, and formal
documentation may increase the risk of _______________
_____ ______________.
4. Thorough understanding of various problems related to data
ownership is needed to make _____ _________________
__________.
True or False
1. Every person who works in this field is deeply proficient in
both the domains required for creating a product.
2. How effectively you can use data depends on its quality.
Multiple Choice Question
1. Creative Commons has a set of standard copyright licenses that
are used widely. This course as a whole is released CC-BY-NC,
which means it can be reproduced with attribution (BY) for non-
commercial use (NC). Individual components are released CC-BY-
NC-ND, which means they can be reproduced with attribution(BY)
for non-commercial use (NC) without making any changes (ND = no
derivatives). Is it OK to reuse, with attribution, a single video from
this course in your own (non-commercial) presentation?
a. Yes
b. No
101
2. I agree to pose for some photographs you take with the promise
that you will keep these photos private. Some years later,
you change your mind and publish these photos.
Since you own these photos, are you within your
rights?
a. Yes
b. No
3. SuperStore has prominently displayed signs that read, “We
videotape you for your security.” Indeed, they do have multiple
cameras through out the store. Later on you learn that
SuperStore analyzes traffic flow in the videos to make
decisions on store layout and product placement. You feel the
signage is misleading, since the store is using the video not
just for security but also to boost profits. Are SuperStore’s
actions ethical?
a. Yes
b. No
Activity
Consider you use Alexa for various activities throughout the day.
Alexa must listen to every vocal interaction in your household so that
it can respond when it is called for. Find out if these conversations and
data are being stored somewhere. If yes, find out who owns the data?
You because the data is yours or the company because you are using
their product.
102
Summary
Data Ownership
If someone records a piece of information about you, is it your
because information recorded is about you or is it recorders because
they have recorded the data ? if you use a piece of information that
does not belong to you, that you have not drafted yourself, you must
credit the owner or the website that you got the information from. The
problem is that you do not usually take all your data and use it. What
you do is take a small piece of what you know, merge it into what
others' know, create something new. The same thing applies with data
that you have purchased and started with.
What does data ownership mean?
As a data scientist, you spend time finding ways to query on multiple
data sets, formatting data to work with different analytics tools, and
converting it to the data you can use by applying as many changes as
you want to the data you have collected. You can compile the data in
the format you want. Your data can be used the way you want. End-
to-end data ownership means knowingly controlling the status of your
data: You are responsible for what data you collect, how you collect
that data, where it goes after collection, and how it gets there. After
compiling the data, you must have the freedom to make any kind of
analysis or exploit the selected data.
103
The ownership challenge:
The growth of the Internet has led to a shift in attention from physical
products to digital goods - the rapidly growing use of Internet-enabled
companies to create data products. As a data scientist, your main job is
to use data to help your organization achieve its goals. How
effectively you can use data depends on its quality, and your
efficiency suffers if you are at the mercy of vendors and tools you use
for collecting the data. Owning your data eliminates wasting time
cleaning it up so you can become a data scientist instead of a data
janitor. Vague ownership of data products comes with the ambiguous
role of data scientists in cross-functional teams.
Limits on recording data
The fact is that the data belongs to whoever is recording the data. On a
contractual basis, you could get the sensible things you want to do.
How much of your movement are they allowed to record and how can
they manipulate it is something to ponder about. There is a need to
record the data. To avoid the misuse, we have limited the use instead
of limiting the recording. These agreements can be made in
writing.They can be part of written contracts and then have legal
force. And whether they are actually written down or not, these
insights remove the barriers to many transactions. If you look at our
security agencies, they really do not know what they are going to
need. And so we actually have this whole facility where they are
compiling in a lot of data and plan to never to look at it. If there is a
specific need, they will seek authorization equivalent to an arrest
warrant to actually view the data.
104
Considerations/issues in data ownership:
Researchers should have a thorough understanding of various
problems related to data ownership in order to make better data
ownership decisions. While ownership rights can be easily assigned, it
is difficult to associate responsibilities with roles because the data is
inherently non-rival. This quirk has led to debates about data
ownership and how it is used and controlled. We classify the
paradigms into three categories according to the socio-organizational
context: individual, organizational and shared ownership (everyone).
The levels of ownership
Level 0: No ownership
The data scientist acts as a consultant for the algorithmic
components of the data product.
Level 1: Owning the DS prototype
The data scientist owns the original algorithm of data
production.
Level 2: Owning the data science part
The algorithm is solely owned by the data scientists. The
container carries out all the pure backend engineering work. Data
scientists can rely on properly defined input and output interfaces.
105
Level 3: Owning the data product (Data Science & Back end
Engineering)
Data scientists have full ownership of the algorithms and
complete control over the engineering part of data production.
Data hoarding:
This practice is considered to be contrary to the general norms of
science which emphasize the principle of openness.
Data destruction:
Companies are entitled to collect data in order to do business with us.
Companies do not want to upset customers. And so, sometimes they
can do things that annoy customers, but usually that's because they
just did not realize they were going to annoy customers that much.
And that gives us some degree of assurance that they will not do
terribly bad things. Unfortunately When a company goes bankrupt, the
data is an asset and their assets are sold like any other asset to third
parties. As a corporate data contributor, you want the data collected to
be destroyed, not sold, if the company goes bankrupt when you do
business with them.
Keywords
Ownership: The act, state, or right of possessing something.
106
Data hoarding: Digital hoarding is excessive acquisition and
reluctance to delete electronic material no longer valuable to the
user.
Data destruction: data destruction as “the process of
destroying data stored on tapes, hard disks and other forms of
electronic media so that it is completely unreadable and cannot
be accessed or used for unauthorized purposes.
Self Assessment Questions
1. What is data ownership?
2. What are the challenges in Data Ownership?
3. Describe the levels of Data Ownership.
4. Write a short note on Data Hoarding and Data Destruction.
Answers to Check your Progress
Fill in the blanks
1. Ownership of data means possession and responsibility for
information.
2. Creating a data product requires expertise in both Backend
Engineering and Data Science.
3. Not having specific policies, supervision, and formal
documentation may increase the risk of compromising data
integrity.
107
4. Thorough understanding of various problems related to data
ownership is needed to make data ownership decisions.
True or False
1. False
2. True
Multiple Choice Question
1.a
2.b
3.a
References
1. https://snowplowanalytics.com/blog/2019/02/05/how-data-
ownership-makes-you-a-more-effective-data-scientist/
108
Privacy, Anonymity And Data Validity
UNIT
5
Structure:
5.1 Privacy
5.2 Anonymity
5.3 Data Validity
Summary
Keywords
Self Assessment Questions
Answers to Check your Progress
References
109
Objective:
After going through this unit, you will be able to
Understand privacy concerns and it’s challenges
Acquire knowledge about benefits, popular techniques and
common threats of data anonymity
Understand data validation
5.1 PRIVACY
Data privacy, security and administration are the main concerns today
due to increasing government regulations. It is often said that
technology outperforms the outside world, but when it comes to
privacy, regulation quickly takes hold. It is difficult to balance
transparency of the data and protection of the data. Without data
collection and the free flow of information, data science would not
exist. However, the more information you gather, the more
complicated it becomes to protect it.
How do we keep user privacy and still build cool products based on
machine learning? The first step in increasing user privacy while
preserving data usage is to go beyond the simple opt-in and opt-out
model. Privacy can be a gray area instead of just black or white
110
colours. There has to be an option between collecting all of the data
and collecting none of the data. Some of these options are for the user
to control the data provided, and some for the developer or data
scientist to control the amount of data stored and used.
There was little privacy in small towns, as everyone in town always
poked their nose into everyone else's business. And as everyone knew
what you were up to, you knew what everyone was up to. Big cities
offer anonymity. Nobody cares, nobody knows you, you can do what
you want. And hence people believe that information technology that
connects the whole world, is like a small town that offers very little
privacy. Data science may result in information that is mostly
asymmetric.
Since big data is universal, the problem associated with it is that it
always remembers everything. There is actually a wayback machine
that archives pages on the web. This archive contains almost
everything that is accessible on the web, everything except things that
are password-protected. And the intent is to keep this information
forever. A web page will survive forever in an archive, even if the
page itself has since been removed.
There are degrees of privacy. Privacy boundaries differ from person to
person but different boundaries does not mean no boundaries. For
example, consider the browsers that you use. In incognito mode, you
don't get the same services that you would get if you used normal
browsing. However, this is a compromise that browser manufacturers
have offered us and people are using it as they see fit. Another thing to
keep in mind about privacy is that sometimes the decisions are not
111
necessarily made by you. Take an example of an instagram feature.
You went to a party and posted a story on close friends because you
don't want everyone to know that you attended the party. Your friend
posts it as a normal story. Given the scenario that you both have a lot
of mutuals, the people you were hiding the story from saw it on their
profile. This was a pretty harmless situation, but there are other, more
serious places where the same thing can happen.
Just because the data was collected doesn't mean that you will be hurt
by it. For example, consider CBI or surveillance. If you want to dig up
information about a potential criminal, you need to have information
in your database. So your data is collected and promised to not looked
at until it is needed. This is also standard practice with surveillance
cameras. In summary, the loss of privacy occurs when control over
personal data is lost. There are three main potential reasons for a data
breach:
First is surveillance. It could be government agencies doing this for
national security, or it could be private investigators and private
companies, having surveillance cameras and logs of various kinds.
Second is advertising. You realise that your data is collected and read
when an ad follows you everywhere. Using your data and showing
similar products is useful when done on the same online store. But if
the product follows you outside the online store, something is wrong.
And then the third category is finding out about a person. Whether
that person is a potential employee, a potential borrower, or a potential
appointment. To find out what kind of person they are, you find out
112
their electronic trail. This might be done by companies or by
individuals. Sources of this information can now collect huge amounts
of data and these are things that have been used to infringe or
potentially damage the privacy of the stored data. Most people are not
aware of what can be learned about them by contacting multiple
sources. Data brokers combine information from multiple sources to
create more complete information products. So they will piece
together a little bit of information, from various sources, about you
and put it all together and end up with a valuable profile that they can
sell to people who may want to hire you or want to give loan to you.
Waste data collection
For example: you go to a club and show them your driver’s license as
a proof of age. Now imagine that this is a high end club, where the
bouncers scan the ID instead of just looking at it and verifying your
age. Would you object? Or try to convince the bouncer to just look at
it and not scan it? Now if they scan, they have a computer on which
they get this information and choose to store it. So now they have your
name, date of birth and other information on your driver's license that
they can use, for example, for marketing.
Data storing information about data is Metadata. It often has less
protection of privacy than data and is different from data content. For
example, a metadata in a phone call may contain information about
the caller or the receiver, the time and date of the call, the duration of
the call, location and so on. Knowing the location does not reveal the
contents of the call and the user has to disclose his location to the cell
113
phone company for the service. However, tracking a location can
reveal a large amount of information about that person.
The power of analysis
For example, your home's smart water meter is constantly recording
water usage. It can identify the signature water uses. And when this
smart water meter communicates with your utility, your utility knows
every time you actually flush the toilet in your home. This is
something you might consider an invasion of privacy, but it is all part
of water conservation and a smarter water use system in the
community. To take this to another level, there have actually been
attacks on encrypted data based on observing the power consumption
of chips performing encryption.
The data exchange is now contractual and we need to ensure that these
contracts are developed in a way that privacy is properly respected and
managed. It is difficult to determine how best to protect data and
maintain privacy. As a data scientist, when you work with potentially
sensitive data, you need to determine what data you can use and how
best to protect it.
Your first choice is about your data. It is best not to use sensitive data
at first. Let's start there: if you didn't have the data, would you actually
need it enough to do tedious gathering? Oftentimes it might be nice to
have extra data or variables, but you don't really need them.
Suppose you realize that you really need the private data.You might
want to make some characteristics out of attributes like geographic
location or gender or level of education. When you need to keep
114
sensitive data and use real value. Use this data to determine whether
K-Anonymity can help protect privacy. K anonymity protects
identification attacks by creating buckets that contain a single person
in a group of K people. This means that you are essentially trying to
shrink the data. If you don't need to keep sensitive variables, you can
use homomorphic pseudonymization or synthetic data to keep valid
values in your data set without revealing the true values. Suppose you
are still unsure whether or not you need to maintain real values. If this
is for research or exploration purposes only, use pseudonymized or
synthetic data first. If you don't need to test or research and still need
to use private data, the next step is to examine how to securely share
this data analysis or data model.
When you publish a machine learning model based on sensitive data,
you should be aware of privacy-preserving machine learning. Show
that differentially private aggregation results still contain enough
information to make product changes.
Finally, let's say you're working on a model or data analysis for
internal use only, which has become popular and which the company
wants to publish or share with others. Your best strategy is to use the
same methods to repair the model.
The Challenges
Data scientists working with user data face several challenges:
Make data both protected and accessible (when lawful
disclosure is required)
115
Creating ways to share and process data that not only protect
privacy, but also allow information to be withdrawn if
necessary
Learning to work with limited data, the use of which is
restricted or regulated by law
Maintaining sufficient flexibility and interpretability to ensure
sufficient transparency of the processes (and also to make the
technology future-proof)
For projects intended for several countries: Compliance with
different regional laws regarding data protection and security
One of the direct requirements of many regional regulations is to
identify the data collected, separate information from the real people
to whom it applies, and “desensitize” them to third party access.
Data Anonymization
One way to achieve this is to anonymize the data. The disadvantages
of anonymization from a business perspective, are that it is
irreversible by design, while fully anonymized data has limited
strategic analytical or aggregative use. One of the simpler versions of
it is suppression in which identifiable pieces of data are changed to
fixed, predetermined values.
116
Data Pseudonymization
Pseudonymization and tokenization is an alternative approach that
aims to retain enough true value data while using pseudonyms
(tokens) instead of sensitive and identifying parts to associate them
with their originators. This approach makes changes reversible with
the application of additional variables (separately or otherwise
additionally protected) and ensures compliance with data protection
regulations.
Data Generalization
As the name suggests, data generalization is a method of de-
identifying data by changing specific values into broader categories.
Grouping people of a certain age into general age groups, locations
into larger areas, and so on, and making sure that the data cannot be
converted to its original state is a popular method that is widely used
in traditional data analysis.
Data Encryption
One way to provide data security and private data exchange is to
deploy data encryption while implementing neural networks and
models.Symmetric encryption, asymmetric encryption, or the
employment of a combination of the two ensures that the initial state
of the data can only be obtained by an authorized agent. Using
encryption and creating models that work with encrypted information
can then provide an opportunity for agents to exchange data more
securely and create more ‘models as a service’ without compromising
the privacy of sensitive data.
117
Synthetic Data Generation
Used both as a method of effective machine learning with small data
sets and as a way of data de-identification, synthetic data generation is
a promising concept that can be used effectively to ensure compliance
with privacy laws. Various methods are currently being actively
developed as multipurpose technologies to create and use reliable data
that is devoid of sensitive, personally identifiable, or difficult-to-
obtain details in various scenarios.
The future of data privacy
On the one side, “ethical AI” models could make data collection more
compliant, and AI algorithms are undoubtedly fighting at the forefront
of cybersecurity. On the flip side, AI-powered cyber attacks are
probably the biggest global threat to data security. And there are more
insidious threats: Bad data and human biases can reinforce all sorts of
nasty AI-driven biases, the exact opposite of what the data science is
trying to achieve.
Increasingly, numerous rules have been enforced to protect our data
from misuse. The good news is that there are various data privacy
techniques available for adoption. For example, federated learning.
The need for privacy and security is paramount in the industry and
privacy enhancing technology should be available everywhere.
118
Conclusion
Protection against leaks and violations can be viewed as protection
against general abuse, including fraud, theft, bias and many other
aspects that essentially go into the concept. By creating regulatory-
compliant, transparent, fair and secure practices in handling sensitive
data, we ensure that data science becomes the age of efficient,
personalized, automated solutions.
5.2 ANONYMITY
Most of the data we collect today can easily be linked to an individual,
household or institution. However, using data without regard to
protecting the identity of the data owner can lead to numerous
problems and potential litigation. In simple terms, data anonymization
ensures that we cannot identify the actual data owner based on the
data.
“Anonymous” Data Won’t Protect Your Identity
A new study shows that it is surprisingly easy to identify a person
within a supposedly incognito data set. So let's think a little about
when anonymous online transactions are possible. There are things
like Bitcoin that can be used to pay anonymously. Many of the
transactions that you may want to carry out, even while doing them on
the internet, require ID. When booking a plane ticket or a hotel room,
you must provide your name and other identification. When you get
cellular service, you'll need to provide your location. If you want
119
effective medical care, you need to disclose confidential details of
your health and lifestyle. And so there are some things you possibly
can do without saying who you are, but for most things you need to
say something about you and at least reveal some aspects of yourself.
De-identification is the removal of information that could be used for
identification, from data. So the general interpretation is: if things like
name, phone, address are present in a data set, remove them. And
then, there are no personally identifying attribute values in the data
set. And so the identity of the person is not immediately apparent. The
point is, we need to realize that once we have done a de-identification,
it all means that the person's identity is not immediately obvious, but
could possibly be determined.
How de-identification works.
Oftentimes, data is structured as a network or chart. Although you've
removed the labels that tell you which node on the network is what, if
you can match the structure of one diagram with the structure of
another diagram, you can find out things. Other ways in which de-
identification is defeated are through a combination of several partial
identifications. The simplest method of anonymizing data is to simply
delete personal data or replace it with a pseudonym or an artificial
identifier. This can be reversible if there is a look-up table to convert
the pseudonyms back to the original data, or irreversible if the data is
completely removed.
If you cannot prevent de-identification, four different types of leakage
can occur. What we have focused on so far is identity revelation. But
120
there could be other things to reveal that are less than revealing full
identity, but could nonetheless be extremely harmful. So you can
show the value of a hidden attribute and may be able to uncover the
connection between two entities. If you have metadata for a call it
could be enough for you to figure out what the friendship circles are
through networks of phone calls. And that could tell you which people
are friends and which entities are related. For example, if you track a
person’s cell phone location and see they are at a religious place every
Sunday morning, You can firstly determine that they are religious and
secondly what religious denomination they belong to. Regarding data
that you consider private and that actually becomes public, it's not just
algorithms or attackers, your friends and relatives can play a role as
well. You go to a party. You don't tell people you went to that party,
but a friend posts photos from a party. And you're there in some of
these photos. And so anyone who sees these photos will know that
you were at the party. And you have no control over what photos your
friends have posted.
There is enough other data in the world to defeat de-identification. We
should design other parts of our system to work, provided that
someone who really wants to can breach anonymity. Access to data is
vital for many desirable purposes. And when fears of disclosing
sensitive personal information prevent it, we lose the benefits in terms
of medical advancement or in terms of watchdogs overseeing how our
government invests resources. If you have unidentified data, it means
that you have put a lock on the door of your house. It means that a
passerby cannot just enter. That doesn't mean you have a door that
couldn't be broken, or a lock that couldn't be opened, or a window that
121
couldn't be broken to force entry. It is certainly not the case that my
home is guaranteed to be invulnerable. But locking the door when you
leave the house is still valuable. Same is the case of de-identification
of data. You have not de-identified data and this means that occasional
identification will not be possible. Someone who works hard enough
to do this can re-identify it. And we aim to make it difficult for such
people to access the data.
Another example: Credit card statements are very personal. If you
decide to examine a person's statement minutely, you can pretty much
figure out all the habits, faith and other intimate details that you may
not know if you were actually friends with this person. These are
personal things that other people may not want to share with you. So
when you show you your credit card statement you can get a look at
the bigger picture. How does a company deal with this conflict that
the statement will only be seen by a handful of people who are
responsible for processing the expense receipt? Possibly some
downstream accounting or auditing firms or groups like this, but
people who have business needs. And their number is small. And that
gives us the kind of privacy guarantee that allows us to share
something as personal as a credit card statement.
The anonymization is intended to guarantee the protection of the data.
This data category includes:
Personal data
Business information such as financial information or trade
secrets
122
Classified information such as military secrets or government
information.
The General Data Protection Regulation (GDPR) defines anonymized
information as:
“Information that does not relate to an identified or identifiable person
or to personal data that has been anonymized so that the person
concerned cannot or can no longer be identified.”
The “identifiable” and “no longer” parts are essential. Not only does
this mean that your name should no longer appear in the dates. It also
means that we can't figure out who the person is from the rest of the
data. This refers to the re-identification process (sometimes de-
anonymization).
This opens up a few possibilities:
Selling data is an obvious first use. Worldwide, data protection
regulations restrict the trade in personal data. Anonymized data
offers an alternative for companies.
It's an opportunity to work together. Data is shared by many
companies for innovation or research purposes. Risks are
limited by using anonymized data.
Also it creates opportunities for data analysis and machine
learning. Anonymized data is a safe raw material for statistical
analysis and model training.
Pseudonymization and de-identification are indeed a way of protecting
certain aspects of data protection. Pseudonymization techniques
remove or replace the direct personal identifiers from the data. Indeed,
123
indirect identifiers are often retained in the rest of the data. This is
information that you can combine to create direct identifiers. De-
identification techniques remove both direct and indirect personal
identifiers from the data.
The key benefits of data anonymization
Anonymous data does not require any additional safeguards to ensure
its security. This means, among other things:
Consent is not required to process it
It can be used for purposes other than its original one
It can be stored indefinitely
It can be exported internationally
The most popular anonymization techniques
There are a number of different techniques and methods used to
permanently mask the original content of the data set.
Randomization:
Noise addition: Here, for example, personal identifiers are expressed
imprecisely. For example:
height: 190 cm → height: 330 cm
Substitution / Permutation: When personal identifiers are mixed in a
table or replaced by random values, for example:
124
ZIP: 411032 → ZIP: postcode
Differential data protection: When personal identifiers of a data set are
compared to an anonymized data set stored by a third party with
instructions on how to use a noise function, an acceptable amount of
data leakage is defined.
Generalization:
K-Anonymity: where personal identifiers are generalized into an area
or group, for example:
Age: 15 → Age: 10-25
L-Diversity: If personal identifiers are generalized first, then each
attribute within an equivalence class will occur at least n times, for
example: properties are assigned to personal identifiers, and each
property will occur with a record or partition with a minimum number
of times.
The most common threats:
However, each of the techniques described above has its own set of
dangers, especially when tested against the three most common risks
in anonymizing data, which are:
1. Singling out: The ability to isolate some or all of the records
that identify a person in the record.
2. Linkability: The ability to link at least two data sets on the same
data subject or group of data subjects (either in the same
database or in two different databases).
125
3. Inference: The ability to derive the value of an attribute from
the values of a number of other attributes with a significant
probability.
For these reasons, it is very advisable to use not just one, but a
combination of several anonymizations at the same time in order to
prevent your data set from being identified again. But, even this
approach does not necessarily lead to complete data security. With so
many different cross-referenced public records now available, any
record with a reasonable amount of information about someone's
actions has a good chance of matching identifiable public records.
Disadvantages:
While data anonymization has some very powerful advantages, don't
forget about its disadvantages.
It is important to note that if you want to anonymize new data
collected from your website, you either need to obtain consent to
collect personal information and then use anonymization techniques,
or just need to collect anonymous information from the start. As safe
as this approach may sound, it also robs you of any valuable insight
that you can gain with more detailed information about your
customers. If you remove all common identifiers from your data, you
won't be able to offer your customers and visitors a more personalized
approach by, for example, providing them with customized messages
and special offers or recommendations.
126
Final thoughts
Anonymization is definitely one of the best ways to keep the collected
information safe. With this additional security measure, your data can
be used freely in a way that is not permitted by law when it comes to
non-anonymized data. However, using personal data in its pure
(original) form also offers some significant advantages. This is why
we really need to think through the pros and cons of each option
before making a final decision. Regardless of the method chosen, it is
important to remember that storing data in a secure environment is
also of paramount importance.
5.3 DATA VALIDITY
Validity is a measure of whether or not the data we have available
support a conclusion that it is true. It is important to remember that
validity relates to the causal mechanisms between the observed “A”
and the observed “B” in the sample or population studied.
So take an example of Twitter, a popular source for analyzing public
opinion. But we know that Twitter users are not representative of the
whole population. Furthermore, even if we know what the
characteristics are of Twitter users, tweets aren’t representative of the
opinions of all Twitter users. Most Twitter users don't tweet at all,
they just read tweets. So what we really hear are the opinions of the
opinion makers. And that can or cannot be a good representation of
what the population as a whole. Similar example is company
127
feedback. They listen to customer responses on their survey forms. It
is reasonable to do so, but a company can handle a complaint
regardless of whether that opinion is representative of the population.
This distinction between the data we have and the data we want is
something that is important to keep track of. When we don't have
exactly the data we want, statistical techniques exist to help us weigh
the samples we need to balance at least the important attributes. So, if
we feel that opinions may differ based on race, gender, or age, we can
try to ensure that the samples we have collected are balanced in these
terms.
Training data is also a problem if you want to deal with systems that
work with the future population when the future population is
different from the previous population. If the future does not resemble
the past, which happens from time to time, we need to pay attention to
these singularities. A much more difficult thing is the gradual drift. As
society changes over time, the nature of our population can change.
And things that were trained some time ago will no longer work
because we would have moved away from what they were trained on.
It is an open question how to deal with this type of drift in terms of
retraining. We need to choose which attributes to use and how to
measure them. And these are important things when we set up our
data analytics. First, let's look at what attributes we choose. This is
something where our choices are usually limited by what is available.
Another source of errors are errors in data processing or errors in the
data. Modern data analytics can often involve a lot of fancy data
processing. For example, if we try to extract feelings from the text
128
based on a long paragraph that may include sarcasm. Finding out such
things is not easy. Detect faces from photos, capture two databases,
and find records for the same person in the two databases where the
fields recorded are different, and find out that they are really the same
person and not different people. All such actions are difficult to
perform well. There are a lot of great technologies out there for
making them as good as possible. So we should expect errors to occur
due to errors in data processing. In addition, we also have human and
subjective errors. They are simple data entry errors, typographical
errors. People confused about whether five was best or worst on a
scale of one to five. Things that required entering data were encoded
and the wrong code was reported. Things where a person
misunderstood the meaning of a data field because the definition was
complicated or the name was misleading. For many of these reasons,
errors can occur. And these mistakes can lead to poor results.
So let's take an example when we consider errors in the data. Credit
Reports. This is how companies work: when you report to credit
bureaus that your payments are up to date on a monthly basis. These
agencies create credit reports and calculate a credit score. There are
often errors in credit reports, but there is a process to correct it too.
One study found that these mistakes resulted in a 20 point or more
lower credit score. That's a big enough difference that could result in
someone being turned down on a loan or being offered a higher
interest rate than they would otherwise have really earned. And this is
true of a system that has been enforced for some time, is well
regulated, and has a well thought out process, even if it is not perfect.
Take this now and enlarge it in the context of what we are doing
129
today. We make material decisions based on public data or data
provided by third parties. We know that this data is often incorrect.
And the question is, does the affected subject even have a mechanism
to correct errors? We want data sources that are authoritative,
complete and timely to use, data sources that provide you with good
data and provide access to the subject of that data.
You could come to invalid conclusions from your analysis because of
errors in model design, even if you had perfect inputs and perfect data.
There are many ways the model could be wrong. So let's look at all
five things one by one.
1. A question model structure. It is important to note that most
machine learning only estimates parameters to fit a
predetermined model. Very often we choose a very simple
model. Not because we know the simple model will work, but
simply because it is easy. So if you have a complex nonlinear
process happening in the world and you have decided to make a
simple linear model. You could try fitting the data to your linear
model and you can learn the best linear fit. It won't be perfect
and you will be making a decision whether or not this
represents something correctly.
2. Extrapolation. This can be very dangerous unless we have
reason to believe that our model has been chosen correctly.
There is a problem with function selection. Often we have
aggregated data and therefore analyze the results for a group.
Based on this analysis of the group data, we attribute the results
to the individual. For example, let's say we have district-level
130
data showing that higher-income districts have lower crime
rates. Well, from here we might be tempted to conclude that
richer people are less likely to commit crimes. Indeed, this
conclusion does not follow from the aggregated data. This is
one possible reason we might have gotten the overall score we
got, but there are other ways related to the distribution of people
with incomes, crime rates and districts.
3. Change is another source of error. Good management of change
is critical to maintaining the validity of our results. We live in a
complex world and we often analyze complex systems with
many complex steps in our analysis. And over time, our system
changes. The question we have to answer is whether the
analysis is still valid. Most of the changes may not affect the
analysis, but some do. Organizations work very hard to think
about the metrics by which they should measure the
performance of their employees because they know that these
metrics optimize employees as opposed to things that are not
counted for these metrics and these other things can actually be
very important. Now it’s not just about the metric, it’s also
about fooling the metric that we need to worry about. So if
critical inputs can be manipulated, they will be manipulated.
Finally, it is important that we carefully check the validity of
our data and the model. There are many ways we can corrupt
either the data or the model or both. And when we do that, we
get bad results. And since the results of our analysis do have an
impact on society, these mistakes we make can cause real harm.
131
4. Whenever you collect data, you need to worry about the
population that you are collecting that data on. And whether the
population to which you want to apply the results of this data
collection analysis matches the population on which you
collected the data. One place where this is certainly true is in
medicine. The point is, you need to think about the population
that you are studying and you need to make sure that you are
collecting data on the correct population. If you want to divide
your study into population groups, you really need to think
carefully about how you do this segmentation.
5. Most of the people who write algorithms are not racist and do
not want to write algorithms that have racist ramifications or
make racist decisions. However, the way the algorithms actually
work could have a significant racial impact. One study found
that people with black names were shown more ads related to
lawyers, arrests, and other criminal matters than people with
white names. When search engines choose which ads to show,
they are based on the ads that are likely to be of interest and to
which users are likely to respond. This decision is a purely
statistical frequency decision made by the algorithm without
thinking about race. Efforts are taken to reduce such errors.
Check your Progress
Fill in the blanks
1. Without _________________ and
_________________________, data science would not exist.
132
2. _______________ protects identification attacks by creating
buckets that contain a single person in a group of K people.
3. It is advisable to use a combination of
_____________________________ at the same time in order to
prevent your data set from being identified again.
4. Most machine learning only estimates parameters to fit a
________________ model.
True or False
1. Fully anonymized data has limited strategic analytical or
aggregative use.
2. De-identification cannot be defeated through a combination of
several partial identifications.
3. If teenagers knew that their parents could have access to all
their social media posts, then the teenagers would likely be very
careful about what they post.
4. Undesired analysis of previously collected personal data
violates privacy.
5. If you are on social media, I cannot predict your political views
if you never once posted anything remotely political.
133
Multiple Choice Questions
1. Many psychology experiments are conducted on university
campus by academic researchers. The human subjects recruited
tend to be college students, who are generally younger and
smarter than the population as a whole. This is clearly not a
representative sample of the general population. To show
universal validity of an important effect, researchers went off-
campus to a nearby city and recruited volunteer subjects by
offering a small cash incentive for their time.
a. The second experiment is a good random sample of the
population.
b. The second experiment is not a good random sample
either, and so is pointless.
c. The second experiment is not a good random sample, but
is still valuable.
2. Looking at Yelp for reviews, you find a review for a restaurant
that gives it one star but praises it effusively. What is the most
likely explanation?
a. The user had some major complaint that you are not
seeing in the text
b. The user was just confused about the scale and meant to
give 5 stars
c. There is some bug in Yelp software that caused this to
happen
134
3. Your company has a promotion it intends to run to attract new
customers. To understand if this is correctly tuned, they send a
survey to current customers. Are the survey results likely
useful?
a. Yes
b. No
c. Can’t say
Activity
Discuss personal privacy options of a popular social media and what
users would perceive as a violation. For example, think about what
others could learn about you from the posts on your wall by merging it
with other information.
Summary
Privacy
It is difficult to balance transparency of the data and protection of the
data. There has to be an option between collecting all of the data and
collecting none of the data. Since big data is universal, the problem
associated with it is that it always remembers everything. And the
intent is to keep this information forever. Sources of this information
can now collect huge amounts of data and these are things that have
been used to infringe or potentially damage the privacy of the stored
data. Another thing to keep in mind about privacy is that sometimes
the decisions are not necessarily made by you. It is difficult to
135
determine how best to protect data and maintain privacy. As a data
scientist, when you work with potentially sensitive data, you need to
determine what data you can use and how best to protect it. It is best
not to use sensitive data at first. When you need to keep sensitive data
and use real value. This means that you are essentially trying to shrink
the data. If you do not need to keep sensitive variables, you can use
homomorphic pseudonymization or synthetic data to keep valid values
in your data set without revealing the true values. If you do not need
to test or research and still need to use private data, the next step is to
examine how to securely share this data analysis or data model.
Finally, let's say you are working on a model or data analysis for
internal use only, which has become popular and which the company
wants to publish or share with others.By creating regulatory-
compliant, transparent, fair and secure practices in handling sensitive
data, we ensure that data science becomes the age of efficient,
personalized, automated solutions.
Anonymity
De-identification is the removal of information that could be used for
identification, from data. The point is, we need to realize that once we
have done a de-identification, it all means that the person's identity is
not immediately obvious, but could possibly be determined. Although
you have removed the labels that tell you which node on the network
is what, if you can match the structure of one diagram with the
structure of another diagram, you can find out things. The simplest
method of anonymizing data is to simply delete personal data or
136
replace it with a pseudonym or an artificial identifier. This can be
reversible if there is a look-up table to convert the pseudonyms back
to the original data, or irreversible if the data is completely removed.
There is enough other data in the world to defeat de-identification.
The anonymization is intended to guarantee the protection of the data.
It also means that we cannot figure out who the person is from the rest
of the data. Anonymization is definitely one of the best ways to keep
the collected information safe. With this additional security measure,
your data can be used freely in a way that is not permitted by law
when it comes to non-anonymized data. However, using personal data
in its pure form also offers some significant advantages. This is why
we really need to think through the pros and cons of each option
before making a final decision. Regardless of the method chosen, it is
important to remember that storing data in a secure environment is
also of paramount importance.
Data validity
Validity is a measure of whether or not the data we have available
support a conclusion that it is true. It is important to remember that
validity relates to the causal mechanisms between the observed “A”
and the observed “B” in the sample or population studied. This
distinction between the data we have and the data we want is
something that is important to keep track of. When we do not have
exactly the data we want, statistical techniques exist to help us weigh
the samples we need to balance at least the important attributes.
Training data is also a problem if you want to deal with systems that
137
work with the future population when the future population is
different from the previous population. We want data sources that are
authoritative, complete and timely to use, data sources that provide
you with good data and provide access to the subject of that data.
Based on this analysis of the group data, we attribute the results to the
individual. Finally, it is important that we carefully check the validity
of our data and the model.
Keywords
Pseudonymization: When data is pseudonymized, the
information that can point to the identity of a subject is replaced
by “pseudonyms” or identifiers. This prevents the data from
specifically pinpointing the user.
Anonymization: Anonymization is a data processing technique
that removes or modifies personally identifiable information; it
results in anonymized data that cannot be associated with any
one individual.
Data Validation: Data validation means checking the accuracy
and quality of source data before using, importing or otherwise
processing data.
Self Assessment Question
1. Which are the methods to deal with challenges in Data Privacy?
138
2. What are the key benefits of Data Anonymization?
3. Describe in detail any one technique of Data Anonymization.
4. What is Data validity?
Answers to Check your Progress
Fill in the blanks
1. Without data collection and the free flow of information, data
science would not exist.
2. K anonymity protects identification attacks by creating buckets
that contain a single person in a group of K people.
3. It is advisable to use a combination of several anonymizations
at the same time in order to prevent your data set from being
identified again.
4. Most machine learning only estimates parameters to fit a
predetermined model.
True or False
1. True
2. False
3. True
4. True
5. False
139
Multiple Choice Questions
1. c
2. b
3. b
References
1. https://towardsdatascience.com/ai-ml-and-data-analytics-in-the-
age-of-privacy-regulations-2b79447d5239
2. https://ico.org.uk/for-organisations/guide-to-data-
protection/guide-to-the-general-data-protection-regulation-
gdpr/what-is-personal-data/what-is-personal-data/
140
Algorithmic Fairness
UNIT
6
Structure:
6.1 Introduction
6.2 Correlated attributes, Misleading but correct results and P-hacking
6.3 Definitions of Fairness
6.4 Potential causes of Unfairness
6.5 Other Methods to Increase Fairness
6.6 Final Comments
Summary
Keywords
Self Assessment Questions
Answers to Check your Progress
References
141
Objective:
After going through this unit, you will be able to
Recognize factors that lead to bias in algorithms
Learn about the methods to reduce these biases.
6.1 INTRODUCTION
Today, an increasing number of decisions are controlled by artificial
intelligence (AI) algorithms, with automated decision-making systems
increasingly being implemented in business and government
applications. We expect algorithms to perform better than human
beings for several reasons:
1. Algorithms can incorporate a lot more data than a human can
capture and take into account a lot more considerations.
2. Algorithms can perform complex calculations much faster than
humans.
3. Human decisions are subjective and often associated with
prejudice.
Therefore, there is a common perception that decisions are more
objective or explicit when using automated algorithms. Unfortunately,
this does not happen because AI algorithms are not always as
objective as expected. While many automated decisions can have a
142
significant impact on people's lives, evaluating and improving the
integrity of decisions made by these automated systems is of
paramount importance. Indeed, in recent years the concern of
algorithm fairness has made headlines. For example, when hiring
people for software development and technical positions, Amazon
found that the AI hiring system discriminated against female
candidates. One suspected reason for this is that most of the historical
data recorded was for male software developers. Another scenario in
advertising showed that Google's ad targeting algorithm suggested
higher paying executive jobs for men rather than women. These lines
of evidence and concerns about algorithmic objectivity have led to a
growing interest in the literature on how to define, evaluate, and
enhance objectivity in AI algorithms. However, it is important to note
that the task of improving the fairness of the AI algorithm is not
insignificant because the underlying trade between accuracy and
justification is closed.
It turned out that the algorithms could be biased. And there are many
ways in which this bias can arise. You may have these biases because
the training data set is not representative of the population or the
previous population is not representative of the future population.
Considering a simple example. There is a company that employs only
20% of female employees and has a boy's club culture that makes it
difficult for women to succeed. Since hiring algorithms are trained on
current statistics and based on the success of current employees, it
underestimates female candidates. The algorithm pretty much
represents what is happening today. It is quite representative that
143
women have difficulties being successful in this company because of
the boys' club culture. The net effect of this, however, is that the
algorithm rates female applicants lower. And so the company is hiring
even fewer women. This is a vicious algorithmic cycle. It arose due to
a bias in the algorithm. And the distortion arose in the algorithm
because of the data it was trained on.
6.2 CORRELATED ATTRIBUTES, MISLEADING BUT
CORRECT RESULTS AND P-HACKING
We will discuss three different things.
1. A notion of correlated attributes.
2. A notion of correct but misleading results.
3. P-hacking.
Correlated attributes and correct but misleading results:
Caste discrimination is a major issue in India. Laws have been
enforced to reduce the discrimination. For example, some universities
have reserved seats for students belonging to particular categories. As
a result of this, a 96% general category student may be denied
admission in prestigious college but a 75% reserved category student
may be granted admission in the same institute. Better would be
completely taking out the reservation and simply taking the top ten
percent of the class in the said state. The point is, big data provides the
technology to enable such proxy discrimination. And the salvation is
144
that Big Data also provides the technology to identify and combat
such discrimination. Hence, it is up to data scientists, to make sure we
are harnessing the power of technology to do the right thing. One of
the things we want to do is stop unintentional discrimination. Intent is
important when thinking about discrimination. And you have to be
clear about what kind of discrimination you want to avoid.
What we would like to consider discrimination is when a person from
the target audience is treated differently than an otherwise identical
person who is not from the target audience. When a black and white
person with identical qualifications and one of them is called back and
the other is not for an interview. Well, we can all agree that this is
discrimination, but that is discrimination on an individual level.
Another thing to keep in mind is that unintentional discrimination is
difficult to avoid. It happens and we need to use our data analysis
techniques to better avoid them. An experiment showed that
significantly fewer women than men were shown online ads for high-
paying jobs. This is likely because the ads shown were chosen based
on recent past click-through rates. And men probably clicked these
ads more often to get high-paying jobs than women. And so, again,
this was not deliberate discrimination on the part of a company and its
choice of advertisements. But algorithmic discrimination took place
because it proclaimed the status quo.
Variables are often correlated, and one should not be surprised to find
that while thinking that one is dealing with one variable of interest,
one is also indirectly dealing with another variable that one did not
even know was dealt with. A problem for fairness is where we get
145
correct results, but they are misleading or unfair in some way. The
question arises, how do you visualize results or how do you present
them? Consider a rating system, a food delivery system where we look
at user reviews and use it to choose a restaurant to order food like
Swiggy and Zomato. They usually give you a rating between 1 and 5.
So we have two restaurants here. Restaurant A gets an average rating
of 3.2 (a mix of mostly 3s and 4s). There is another restaurant B that
also gets an average of 3.2 (a mix of mostly 1 and 5). The question is,
which restaurant would you prefer? Many reputation websites only
focus on the average and that is what restaurants are ranked by. After
that, you can sort your search results and so on. And that important
difference between the two restaurants is masked if you don't really
deal with it. The point is that something like Restaurant B that people
either love or hate is a restaurant that either is just right for you or that
you will avoid for a mile, depending on whether you're more like the
people who rated a 5 or more like people who rated a 1. Calling a 3.2
is not right as it is either a 5 or a 1 for you depending on who you are.
And figuring out which of the two you are is not easy without extra
work.
Singular scores are a problem for a different reason. So here is Hotel
X, which scores an average of 4.5 points based on 2 user reviews. And
Hotel Y, which got an average of 4.4 based on 200 user reviews. What
would you prefer? 4.4 is less than 4.5, and in terms of sort order, 4.5
could come first. However, we all know that on most websites it is too
easy to leave a few false positives. So we're concerned about that 4.5,
while the 4.4 seems like a much more solid number to rely on with
200 reviews. Let's take a closer look at this somewhat similar
146
example. Now let's say Hotel A gets an average of 4.2 with 10 ratings
and Hotel B gets an average of 4.0 with 500 ratings. You would prefer
Hotel B. But if you also know that Hotel A only has 5 rooms while
Hotel B has 500 rooms, does that affect your decision? Since Hotel A
has fewer customers, you should expect it to have fewer reviews. The
fact that there are fewer reviews shouldn't be denied as it's so much
smaller than Hotel B.
Oftentimes, if you want to talk about the results, you have to get to the
heart of it because that's what you have to do to complete the analysis.
When we have this adjustment algorithm that examines some
attributes and selects the attributes based on which to adjust them.
And the criteria that this algorithm uses were chosen in accordance to
the majority. And if the criteria were different for some minorities, at
least if the optimal criteria could have been different for some
minorities. And the algorithm will do poorly with these minorities.
And because this works badly with these minorities, we have several
different problems. The best minority applicants cannot be recruited.
Although a very skilled minority applicant who would actually do
very well, the person is rejected because our algorithm is badly set.
There is another problem, namely that the hired minority employees
are no longer the best due to the poorness of our algorithm training.
And because they're not the best, they don't do that well in the end. As
a result, they unjustly pollute others in the minority because the
algorithm will now learn that minorities do not do well. And so we
first suppressed diversity and in the end discriminated against it.
147
P-Hacking
What does P values mean? In general, you have a population in the
world and there is some distribution of that population. And often the
type of distribution you see is a Gaussian or bell-shaped curve that
looks something
like this. For example, if we look at the
height of people, we have such a
distribution. Most people are of medium
height and there are some very tall and
some very short. Now let's say someone
has this magical tonic that they claim makes people taller. That's why
we're doing an experiment. We have a null hypothesis, that is, this
tonic is useless. And then we give someone this tonic, (it will be a
control group and a treatment group, but for now assume a single
person as it is more convenient). And then we just watch this child,
who was given this tonic growing up, to what height that person grew
to, and at that point we draw a line that is usually aligned with where
we got the P-value. Name the limit and the typical P-value that we set
is 0.05. So this orange area here is five percent or 0.05 of the total area
under the curve. So let's just choose that vertical line. And then we see
the person received this tonic lies to the right or left of this vertical
line? If this person is to the left then it looks like the height of this
person could just have been a random distribution of the normal
population, so the tonic is probably not effective. If that person lies to
the right, then that person may not have benefited from the tonic, but
the likelihood that this is the case is less than five percent. And if the
148
height of this person is greater compared to the typical height for the
population,it is called a significant effect, and it is very likely that this
tonic was valuable.
So the calculation of the P-values is based on many assumptions.
First, is the distribution nicely bell-shaped? For many real world
phenomena, things are indeed bell-shaped, but for others they are not.
The theory of P-values can be applied regardless of the shape of the
distribution. But when that happens, the math changes and we are
often not that careful about what the distribution is like and where the
P-value limit should be. The other standard problem we have is testing
multiple hypotheses, and the problem here is this: Suppose I have a
hundred different stressed tonics and test each one individually. We
know that each of these remedies has a five percent chance that it will
be declared significant, even if that tonic didn't do any good. When
you test a hundred different independent tonics, you would expect an
average of five out of those hundred to meet these criteria. And if you
had just tested a hundred useless tonics that didn't do anything to
change the size of a child, you would have found five that you could
claim after a scientific test. That may sound very hokey, but that's
exactly what you have to do in high-throughput biology.
Oftentimes we would first observe data and then hypothesize.
Standard p-value math was developed for traditional experimental
techniques where you designed the experiment first and then collected
the data. So first you have your assumptions and then you collect data
specifically to test the assumptions. In data science, we have the data
first, and often we don't even have a hypothesis in mind. We want to
149
rummage through the data and learn from it. In fact, there is a lot that
can be learned from doing exploratory analysis. In today's data science
world, exploration is the first phase of data analysis. This means that
we are now developing hypotheses that fit the data observed. Of
course, if you test a hypothesis later, it will be better with the observed
data.
6.3 DEFINITIONS OF FAIRNESS
In this section, we analyze some of the fairness terms proposed by
fairness researchers for machine learning. Namely statistical parity
and then nuances of statistical parity like equal opportunity and even
odds.
Statistical Parity
Simplest and oldest method of enforcing fairness is statistical parity.
The formula for statistical parity: P(ŷ | p) = P (ŷ)
In words, this describes that the result y is independent of the
parameter p - it has no influence on the probability of the result.
For statistical parity, the result is independent of group membership.
This means that the same proportion of each group is classified as
positive or negative. For this reason, we can also look at statistical
equality as a demographic parity. For non-statistical equality applied
datasets we can measure how far our prediction deviates from
statistics by measuring the distance from statistical equality.
150
SPD = P (ŷ = 1, p =1) - P (ŷ =1, p = 0)
Statistical parity distances can be used to measure the extent to which
a prediction deviates from statistical parity. This gap can provide us
with a metric for how fair or unfair a given dataset is based on a group
reconciliation p.
Some dilemmas of statistical parity:
1. Statistical parity does not guarantee fairness.
Statistical parity says nothing about the accuracy of these predictions.
One group is much more likely to be predicted as positive than
another, and therefore we can see large differences between the true
positive and false positive rates for each group. This itself can have
different effects, as qualified people from one group (p = 0) may be
overlooked in favor of unqualified people from another group (p = 1).
In this sense, statistical parity is more like the equality of results.
2. Statistical parity reduces algorithmic accuracy
A protected class can provide some information that is useful in
making predictions. However, we cannot use this information due to
the strict rules of statistical parity. For example, gender can be very
informative in making predictions about items people might buy.
However, if we are prevented from using it, our model will become
weaker and its accuracy will be compromised. A better method would
allow us to account for the differences between these groups without
having different effects. It is clear that statistical parity does not align
151
with the basic goal of accuracy in machine learning - the perfect
classifier may not guarantee demographic parity. For these reasons,
statistical parity is no longer seen as a credible option for machine
learning by several fairness researchers.
There are slightly more nuanced versions of statistical parity, such as
true positive parity, false positive parity, and positive rate parity.
True positive parity
This is only possible for binary predictions and does statistical parity
on true positives (the prediction output was 1 and the true output was
also 1).
Formula: P(ŷ|y = 1,p) = P(ŷ|y = 1)
It ensures that in both groups of all qualified persons (Y = 1) an equal
proportion of persons is classified as qualified (C = 1). This is useful
when we are only interested in parity over the positive result.
False Positive Parity
This only applies to binary predictions and focuses on false positives
(the prediction output was 1 but the true output was 0). This is
analogous to the true positive rate, but instead offers parity over false
positives.
Positive Rate Parity (Equalized Odds)
This is a combination of statistical parity for true positives and false
positives at the same time and is also known as balanced odds.
152
Note that for equality of opportunity, we are relaxing the equalized
odds condition that the odds of winning must be equal in the event
that Y = 0. Balanced Opportunities and Equalized Odds are also more
flexible and can incorporate some of the information from the
protected variable without producing different effects.
Note that while all of these are solutions that can be considered fair,
none of them are particularly satisfactory. One reason for this is that
there are many conflicting definitions of what fairness means and it is
difficult to capture them in algorithmic form. These are good starting
points, but there is still a lot of room for improvement.
6.4 POTENTIAL CAUSES OF UNFAIRNESS
Several causes have been identified in the literature that can lead to
inequalities in machine learning:
• The datasets used for learning already have biases due to biased
device measurements, historically biased human decisions, erroneous
reports, or other reasons. Machine learning algorithms are originally
designed to replicate these biases.
• Distortions caused by missing data, e.g. missing values or sample /
selection bias lead to data sets that are not representative of the target
population.
153
• Distortions that result from algorithmic goals that aim to minimize
aggregated forecast errors overall and therefore benefit majority
groups over minorities.
• Distortions caused by “proxy” attributes for sensitive attributes.
Sensitive attributes differentiate between privileged and non-
privileged groups such as race, gender and age and are usually not
legitimate for decision-making. Proxy attributes are non-confidential
attributes that can be used to derive confidential attributes. In the
event that the data set contains proxy attributes, the machine learning
algorithm can make implicit decisions based on the confidential
attributes, under the guise of using presumably legitimate attributes.
6.5 OTHER METHODS TO INCREASE FAIRNESS
Statistical parity, balanced odds, and equal opportunities are good
places to start, but we can take other measures to ensure algorithms
are not used to improperly discriminate against people. Two such
solutions that have been proposed are human-in-the-loop and
algorithmic transparency.
154
Human-in-the-Loop
It refers to the paradigm by which humans monitor the algorithm
process. Human-in-the-Loop is often implemented in situations where
there is a high risk if the algorithm makes a mistake. For example,
missile detection systems that notify the military when a missile is
detected allow individuals to review the situation and decide how to
respond - the algorithm will not respond without human interaction.
Imagine the disastrous consequences of operating nuclear weapon
systems with AI that were allowed to fire when they detected a threat -
a false positive and the whole world would be doomed.
Another important and similar concept is Human-on-the-Loop. Unlike
Human-in-the-Loop, in Human-on-the-Loop, humans are passively
involved in monitoring the algorithm. For example, a data analyst may
be responsible for monitoring sections of an oil and gas pipeline to
ensure that all sensors and processes are functioning properly and that
there are no signals or errors. This analyst is in a supervisory position
but is not actively involved in the process.
Algorithmic Transparency
The argument is that if an algorithm can be viewed publicly and
carefully analyzed, it can be assured with a high degree of certainty
that no different effects are built into the model. While this is clearly
desirable on many levels, algorithmic transparency has some
drawbacks.
155
From a commercial perspective, in most cases, this idea is untenable -
trade secrets or proprietary information can be lost when algorithms
and business processes are visible to everyone. Imagine if Facebook
were asked to release their algorithms to the world so that they could
be checked to make sure there were no bias issues. We could
download their code and easily start our own version of Facebook.
Full transparency is really just an option for algorithms used in public
services such as health care, government (to some extent), legal
system, etc.
Going forward, algorithmic fairness rules may be a more durable
solution than algorithmic transparency for private companies that have
a vested interest in protecting their algorithms from the public.
Algorithms could be presented to the regulator or possibly third party
audit services and analyzed to ensure that they are suitable for use
without creating different effects.
There is still a long way to go to ensure that our algorithms are free
from all kinds of biases. With a combination of transparency,
regulations, human-in-the-loop, human-on-the-loop and new and
improved variations in statistical parity, we are part of the way there,
but this field is still young and there is much still to be done.
156
6.6 FINAL COMMENTS
Some may say that these decisions are made on the basis of less
information and by humans, which may have a lot of intuition and
cognitive bias affecting their decision. These decisions automatically
provide more accurate results and largely limit these biases. The
algorithm does not have to be perfect, even better than what
previously existed. Although machine learning, by its very nature, is
always a form of statistical discrimination, discrimination becomes
objectionable when some privileged groups are given a systemic
advantage and some non-privileged groups a systemic disadvantage.
Prejudice in labels or low-high / sampling leads to unwanted bias
models in training statistics.
We have extensively discussed several biases present in training data
due to the way it is collected and analyzed. We also discussed various
ways to mitigate the effects of these biases and ensure that algorithms
remain non-discriminatory towards minority groups and protected
classes.
The data collection and sampling procedures in the statistics class are
often flashy and are not well understood by the general public. Unless
regulatory bodies appear, it is up to machine learning engineers,
statisticians and data scientists to realize that equality of opportunity is
inherent in our machine learning methods. We need to be careful
about where our data comes from and what we do with it.
Some may say that the algorithm is being given free rein to allow
inequalities to be systematically instantaneous, or that the data itself is
157
inherently biased. To help reduce this problem, variables related to
protected attributes should be removed from the data and any changes
related to variables should be removed or prevented. However, we
should not be satisfied with inappropriate algorithms, there is room for
improvement. Likewise, we should not waste all the data we have and
remove all variables as this will make the system perform poorly and
make their use less useful. That being said, at the end of the day, it is
up to the creators of these algorithms and monitoring organizations as
well as those in charge of data collection to try to ensure that these
biases are handled properly.
To conclude, humans have many prejudices. No person is perfect with
good intentions. Psychologists have done a remarkable job of
explaining to us the many unconscious prejudices we all have.
However, there are mathematical definitions of fairness that can be
applied in most cases. We can show some fairness by assuming
something for the algorithm, and we can try to make sure that our
algorithm is as impartial as possible by detecting the type of error we
are talking about.The point is that we need to recognize that
algorithms can also have biases and that they can reflect the biases of
the people they build. And with that in mind, we can reduce the bias
the algorithm actually shows.
158
Check your Progress
Fill in the blanks
1. Big data provides the technology to enable _____
discrimination.
2. Most common type of distribution is a _________ ____
_________________ curve.
3. Simplest and oldest method of enforcing fairness is
________________ _________.
4. ___________ _____________ ____________ can be used to
measure the extent to which a prediction deviates from
statistical parity.
5. _______________ ________________ differentiate between
privileged and non-privileged groups
True or False
1. The criteria that this algorithm uses were chosen in accordance
to the majority.
2. Statistical parity guarantees fairness.
3. False Positive Parity only applies to binary predictions and
focuses on false positives.
4. In Human-in-the-Loop, humans are passively involved in
monitoring the algorithm.
159
Multiple choice Questions
1. You have, at great cost, obtained 5 samples of tissue ravaged by
a rare disease, and measured the levels of gene expression for
10,000 genes in these 5 samples and in 5 additional samples of
healthy tissue as control. You find a list of 200 genes that are
differentially expressed, with a p-value of less than 0.01.
Approximately how many of these 200 genes would you expect
are differentially expressed due to random variation rather than
a real difference?
a. 2
b. 10
c. 100
d. 8
2. A university uses performance on a standardized test as the only
scoring mechanism used to admit applicants. The university
observes that it is admitting far fewer minority students than
their proportion in the population at large. Based on only these
facts, can we conclude that the test is unfair?
a. Yes
b. No
c. Cannot say
3. As part of a gender bias study, your university examines the
average grade obtained by women in STEM classes against the
average grade obtained by men. Given what you know about
aggregation bias, is this comparison meaningful?
160
a. Yes
b. No
c. Need more details to form an opinion
Activity
Based on the facial recognition technology applied to recordings from
video cameras in the high end store, it can instantly identify if you
have previously shopped there when you enter the store. If you have
been identified as a high value buyer from your previous purchase
history in the store, assistance is immediately assigned during your
visit and help you select and find the items you want. Ordinary buyers
who have not been identified as high quality buyers do not receive the
same service. Is that unfair? Why or why not? Does your response
changes when high quality buyers are identified not only by their
previous purchases in this Store but also by their financial value, as
determined by the Store through consultation with an outside data
broker?
Summary
While many automated decisions can have a significant impact on
people's lives, evaluating and improving the integrity of decisions
made by these automated systems is of paramount importance.
However, it is important to note that the task of improving the fairness
161
of the AI algorithm is not insignificant because the underlying trade
between accuracy and justification is closed. Oftentimes biases arise
due to a bias in the algorithm. And the distortion arose in the
algorithm because of the data it was trained on.
Correlated attributes, Misleading but correct results and P-
hacking
The salvation is that Big Data also provides the technology to identify
and combat such discrimination. Hence, it is up to data scientists, to
make sure we are harnessing the power of technology to do the right
thing. What we would like to consider discrimination is when a person
from the target audience is treated differently than an otherwise
identical person who is not from the target audience. When a black
and white person with identical qualifications and one of them is
called back and the other is not for an interview. When we have this
adjustment algorithm that examines some attributes and selects the
attributes based on which to adjust them. And the criteria that this
algorithm uses were chosen in accordance to the majority.
Definitions of Fairness
Statistical Parity
The formula for statistical parity: P(ŷ | p) = P (ŷ). For statistical parity,
the result is independent of group membership. This means that the
same proportion of each group is classified as positive or negative.
For non-statistical equality applied datasets we can measure how far
our prediction deviates from statistics by measuring the distance from
statistical equality.
162
SPD = P (ŷ = 1, p =1) - P (ŷ =1, p = 0)
Some dilemma’s of statistical parity:
1. Statistical parity does not guarantee fairness.
One group is much more likely to be predicted as positive than
another, and therefore we can see large differences between the true
positive and false positive rates for each group.
2. Statistical parity reduces algorithmic accuracy
A protected class can provide some information that is useful in
making predictions. However, we cannot use this information due to
the strict rules of statistical parity.
There are slightly more nuanced versions of statistical parity, such as
true positive parity, false positive parity, and positive rate parity.
Potential causes of unfairness
Several causes have been identified in the literature that can lead to
inequalities in machine learning:
• The datasets used for learning already have biases due to biased
device measurements, historically biased human decisions, erroneous
reports, or other reasons.
• Distortions caused by missing data, e.g. missing values or sample /
selection bias lead to data sets that are not representative of the target
population.
163
• Distortions that result from algorithmic goals that aim to minimize
aggregated forecast errors overall and therefore benefit majority
groups over minorities.
• Distortions caused by “proxy” attributes for sensitive attributes.
Proxy attributes are non-confidential attributes that can be used to
derive confidential attributes.
Other Methods to increase fairness:
Human-in-the-loop
It refers to the paradigm by which humans monitor the algorithm
process. Human-in-the-Loop is often implemented in situations where
there is a high risk if the algorithm makes a mistake.
Algorithmic Transparency
The argument is that if an algorithm can be viewed publicly and
carefully analyzed, it can be assured with a high degree of certainty
that no different effects are built into the model.
To conclude, humans have many prejudices. Psychologists have done
a remarkable job of explaining to us the many unconscious prejudices
we all have. However, there are mathematical definitions of fairness
that can be applied in most cases. The point is that we need to
recognize that algorithms can also have biases and that they can reflect
the biases of the people they build. And with that in mind, we can
reduce the bias the algorithm actually shows.
164
Keywords
Data attribute: Attribute data is data that have a quality
characteristic (or attribute) that meets or does not meet product
specification.
Hypothesis: A supposition or proposed explanation made on
the basis of limited evidence as a starting point for further
investigation.
Distortions: The act of twisting or altering something out of its
true, natural, or original state
Self Assessment Question
1. What is p-hacking?
2. Describe any one method to increase fairness
3. What are the potential causes of unfairness?
Answers to Check your Progress
Fill in the blanks
1. Big data provides the technology to enable proxy
discrimination.
165
2. Most common type of distribution is a Gaussian or bell-shaped
curve.
3. Simplest and oldest method of enforcing fairness is statistical
parity.
4. Statistical parity distances can be used to measure the extent to
which a prediction deviates from statistical parity.
5. Sensitive attributes differentiate between privileged and non-
privileged groups
True or False
1. True
2. False
3. True
4. False
Multiple Choice Questions
1. c
2. b
3. a
References
1. https://towardsdatascience.com/a-tutorial-on-fairness-in-
machine-learning-3ff8ba1040cb
166
Societal
Consequences
UNIT
7
Structure:
7.1 Introduction
7.2 Distributional Unfairness
7.3 Ossification
7.4 Surveillance
7.5 Asymmetry
7.6 Other Impacts
Summary
Keywords
Self Assessment Questions
Answers to Check your Progress
References
167
168
Objective:
After going through this unit, you will be able to
Understand few terms about data science that cause huge
societal impacts with the help of various examples.
Brainstorm ideas that may help in averting the effects of biased
algorithms
7.1 INTRODUCTION
Data evaluation strategies vary depending on the application and
imply heterogeneous societal challenges. On the one hand, ubiquitous
services such as Twitter or Facebook, with their predefined filter
algorithms, make a significant contribution to the structuring of
communication and social life. Second, digital infrastructures open up
expanded possibilities for observation and control, as the profiles of
their users can be assessed and sanctioned much more efficiently than
before.
Oftentimes, engineers and data scientists focus on developing
products that will help solve an applied problem, and typically believe
in a “basic truth” against which to train models and align their
solutions. In contrast, social scientists try to contextualize current
societal dynamics both within the company, broader socio-economic
169
developments and long-term social transformation processes to
explain why something is happening. The larger the data sets, the
greater the likelihood of random discovering congruent patterns and
deviating incorrect correlations.
Despite local concerns, big data strategies have certainly helped
improve services in certain economic sectors over the years. In the
healthcare sector in particular, big data has been increasingly used to
reduce the increasing costs associated with the provision of medical
services. Within the financial sector, big data certainly helps to
prevent or at times reduce fraudulent financial transactions and to
protect vulnerable consumers from identity theft. There is also
growing evidence that big data strategies are being used to reduce
service disruptions in the power and Internet service delivery
industries, with obvious benefits for consumers who increasingly rely
on these services.
Even if data analysis is done in a fair way, if analysis is valid, data is
valid, there are no mistakes, there is no intention to discriminate, there
are no privacy issues, there still could be an impact on society in ways
that are not expected.
The four bullets point we will discuss about are:
Distributional unfairness
Ossification
Surveillance
Asymmetry
170
7.2 DISTRIBUTIONAL UNFAIRNESS
Let's begin with the example. We all hate potholes. So an idea was
developed a few years ago when some people got the idea that cell
phones have accelerometers. The same technology can be used to
detect if the car has run over a pothole. Cell phones have GPS
location. So the idea was if you had the Street Bump app in your car
and every time you hit a pothole it would immediately report it to the
city, the exact location of the pothole. So this clever idea was
deployed in a city. Doing something for a good cause, what could
possibly go wrong? One issue that the City Government recognized
even before they even made this app available is that with this type of
crowd-sourced coverage, the reports focus on the streets used by
people who are rich enough to have a car and a smartphone. And so
they actively worked to compensate for this by letting city workers
drive around poor areas in city vehicles just to make sure potholes
were reported in all parts of the city. Here is a positive example of
someone proactively reflecting on the different effects of innovative
technology.
So when doing data analysis of any kind, one has to think about the
impact that one could have on certain social groups. And
unfortunately, all too often this notion of impact assessment is far
removed from people's everyday concerns. The data scientists develop
the algorithm. They are concerned about their algorithms doing the
right things, and that's hard enough. And if this is the focus, that
notion of different effects is too far removed. As a result,
demographics have different views about technology and, for
171
example, how they view privacy. Sociologists have found that these
differences are significant and unpleasant. Therefore, we should not
assume that the values we have as individuals are reflected in all areas
of society. Their personal experiences with technology and the way
certain things have been used can cause them to react in very different
ways.
Let's look at another example. When you travel internationally and
enter a country, you are subject to customs control. The aim of this
inspection is to prevent people from smuggling things with them. We
know that most travelers are not smugglers and that it is inconvenient,
time consuming and expensive. Customs do not have enough staff to
search every traveler and every piece of luggage. Hence, most
travelers are not searched. They have the right to search all of them,
but they don't. The question is, how are the travelers chosen to look
for? Of course, customs will look for travelers who are most likely to
be smugglers, and even the most likely smugglers are actually not
very likely. The interesting question is: how do they find out who is
most likely to be a smuggler? Presumably an experienced customs
officer now has an antenna that can pick up someone who is acting
suspiciously or has unusual luggage or whatever. But let's assume for
a moment that an algorithm does this. What if this algorithm is most
likely choosing based on various characteristics, but it turns out to be
mainly based on the country of origin? Now that travelers from certain
countries who have been “selected” are likely to be stopped for a
custom search and travelers from other countries will not. Now mostly
what will happen is that travelers from such a “chosen” country feel
discriminated against because they are stopped and searched all the
172
time. They will share stories of customs harassment on social media
and now a segment of the population feels that something is wrong
with custom control and maybe the nation as a whole. And note that
all of this assumes a possibly and not inappropriately decision made
by an algorithm that was data driven.
One way to imagine this is to compare this to stereotypes. Humans
built stereotypes. This is because brains have a lot of things to do and
usually there is a grain of truth behind them and it just jumps to
conclusions. We know objectively that it is completely unfair to the
person being typed. We know that we are adding to this individual our
opinion about a group to which that individual belongs. There's
someone out there we don't know much about so we just use the
stereotypes to guess what they might look like. Algorithms function
similarly. They use various attributes they know about a person to
classify them. For example, to classify them as potential smugglers.
And if that classification is based on a person's membership in a
group, based on the value of an attribute, then the algorithm was asked
for it. So, in fact, we are doing the same thing that we are doing with
stereotypes, only that we now have a measurable objective basis for
creating the stereotype.
This type of problem arises when we think about using data science to
solve problems where we will actually arrest someone or prevent
someone from doing things in a way that severely limits their
freedom. But if you want to go there we have to realize that the
prediction is probabilistic. It just shows the probability. It just suggests
stronger surveillance. Even if you overcome this probability problem,
173
there is another problem with things like predictive policing. So let's
say police forces are deployed more in areas with higher crime rates.
They aren’t actually arresting anyone. They're just deploying police to
particular neighborhoods. Well one thing to realize is that more
surveillance can lead to more crime exposed.
In summary, we need to know the consequences of data science. Our
predictions are probabilistic, we know that sometimes it will be
wrong. And when we know that, we need to understand what the
social cost of mistakes we make. We need to realize that the cost of
failure is usually different for Type 1 and Type 2 defects. These are
mistakes where we mistakenly classify someone as dangerous or
criminal or a bad prospect or something, as opposed to type 2
mistakes where we mistakenly class someone as safe or not criminal
or a perfectly good employee. Given that we have this asymmetry in
the cost of errors, we need to incorporate this into the algorithm itself
and adjust it to minimize the societal cost. For example, when we
build search engines, we have a tradeoff between recall, how
comprehensive the result set is, and precision, how much irrelevant
material is included. And search engines choose it on purpose to be
broad rather than precise. This could be applied in any classification
algorithm. The hard part is we have to have the weights we put on the
two types of error, and that's difficult. We can easily see that it is
asymmetrical, but the algorithm needs these things to be quantified.
And since quantification is difficult, and data scientists will not know
what the weight should be, one can do a lot better, than just assuming
that they're equal.
174
7.3 OSSIFICATION
Another societal problem to think about as a data scientist is
ossification. The idea is that people are selective about the websites
they visit, the news channels they see, the talk shows they hear. Self-
selection strengthens the opinions of others with similar views. And
so, instead of hearing a multitude of opinions and then possibly
changing our minds, we are simply reaffirming whatever our
preconceptions are. In data science, analytics has the ability to make it
harder for people to break out of stereotypes because those things are
burned into the algorithms.
To see an example of this, let's look at what happens when you are
hired by an algorithm. Many companies use algorithms to rate
applicants and then only look at the applicants with the highest scores.
And these algorithms could be based on things known about the
applicants in relation to their resume or they could be based on tests
given to them when they apply. Consider a talent matching company
with a proprietary algorithm that recommends candidates based on
what they know about the candidates. How does it choose who to
recommend? Well, his goal is to find candidates who serve the proven
interests of their client, the employer. If I've already shown you some
candidates and you've decided to interview some of them, I'll get
feedback. Oh, you didn't interview this person I recommended to you.
That other person that I recommended to you, you like that person and
175
you interviewed that person and then you hired that person. This way I
will tweak my algorithms so that I can give you better
recommendations. You will appreciate my recommendations more if I
tailor them to your needs.
It all sounds like a very reasonable thing. One consequence, however,
is that the recommendations of this algorithm that the tuning of the
talent matching company now reflects existing prejudices in the
interests of an employer. In other words, the recommendation
machine, which is completely neutral in terms of value, has learned to
respond to the employer's prejudices, if there were any. If the
employer changes his guidelines or there is another hiring manager,
the recommendation engine happens to be burned into itself, the
prejudices from the past. Over time, it can probably repair itself.
However, we added at least a significant delay and made it more
difficult for the employer to remove previous discrimination, as the
algorithm is a force to maintain the status quo.
Think about the same problem from a different perspective. We know
that people have ingrained prejudices, and many of them are
unconscious and will act in certain biased ways even if we
consciously try not to. And we know that people tend to have similar
people on their networks and that they are more comfortable with
similar people. So, if we rely on purely human ideas about the
attitudes of algorithms, it is far more likely that our attitudes are not
diverse. And so algorithms can be written to explicitly overcome these
distortions. That way we can actually have algorithms that lead to
176
greater diversity. It's just that one needs to make sure that this is
explicitly written into the algorithm design.
Let's look at another aspect of the hiring setting to see some of the
problems that can arise. Some companies have found that employees
who commute long hours are more likely to quit early. It's easy to see
why this could be so. If someone can find an alternative job with a
shorter path, they will likely prefer that. So when someone decides to
hire, it is not inappropriate to actually add commuting time into the
job in the candidate evaluation algorithm. After all, a company wants
to hire people who are likely to be happy and will stay around. You
don't want to spend a lot of money hiring someone who comes to
work tired after a long commute and who is likely to quit after settling
in their job shortly after an initial education.
One problem with the societal impact is that many employers are in
expensive neighborhoods and there may not be cheap housing close to
the employer. So, giving preference to employees who live nearby and
have shorter commutes can influence attitudes towards poor applicants
who live further away because they have no choice. One of the
companies that found itself in such a situation, noted the societal
ramifications of implementing settings with algorithms to decide who
to interview and issued a public statement that commute time was
explicitly included in their settings algorithm was not taken into
account because they believed it had an undesirable social
consequence. It is this kind of self-responsibility for the consequences
that we must consistently take so that we can reap the benefits of data
177
science without doing harm that we did not intend.
7.4 SURVEILLANCE
More and more data is being collected about ourselves and this will
only keep increasing in the future. We are not new to surveillance
cameras, of course, but we have smart meters that record data on what
is happening to our electricity usage or water usage in ways that old-
style mechanical meters just didn't. Smart watches like Fitbits and
other brands record every activity we participate in and every step we
take. The latest versions of these smart watches also record our heart
beats, our breathing rate and other stuff.
If you wear a Fitbit, you're doing it because you want the health
benefits of monitoring your exercise. The data collected also has
social benefits. For example, law enforcement can benefit from some
of these detailed records. Medical science can benefit from analyzing
a large number of electronic health records. In an extreme case, we
might decide, for political reasons, that all of this information should
be completely open, accessible to all, in a world where everyone
knows everything and there are no deceptions.
The problem with this is that the delusions can actually play a role.
We need delusions. We have an explicit idea of erasing the past for a
fresh start. People's records are cleared about jail sentences for many
crimes after a period of specific years that varies from country to
country. We know personal relationships are difficult and sometimes a
178
fresh start is important. Only about our social interactions are we
telling little white lies all the time. We'd have to adjust dramatically if
we were to live in a world where those little white lies weren't
possible, and we had the glare of total honesty.
There is one other problem that should be remembered. The process of
quantification is becoming less important. The understanding of the
actual data is often very extensive. However, if it is reduced to a
number, that number may not reflect what we actually had in this
large amount of data. We should be careful not to rely on such
quantified numbers, which is often the record we make. So these are
some reasons why deception is important. There are also more
obvious and serious things: if you live in a country with an
authoritarian government, small deceptions may even be necessary to
lead a normal life. For all of these reasons, in creating this valuable
and enormous record of data, we need to figure out how we control
who has access to what and difficult decisions to make. And it is
especially difficult because of the anonymity problem, which is that
analyzing anonymized records can bring a lot of value, but
anonymization can in most cases be undone with sufficient effort.
Hence, we need to take care of sharing of any kind, as unrelated
patterns can allow identification. As an example, Netflix was posting
unidentified dates of movies people rented and the ratings they gave
those movies. Some of these people had also posted reviews on social
media for some of these films, and the correlation between them was
sufficient to re-identify themselves.
179
7.5 ASYMMETRY
Look at this example: You have just parked your car. Google Maps
offers to record your current location so that you can find out where
you parked your car. You can also find out how much parking time
you have available. By sharing this data, Google Maps can offer you a
small but valuable service: you can find your car quickly and not have
to pay a fine. This data can only be stored for a limited time for you.
Knowing where you are currently parked is useful but much less
useful knowing where you were parked last week.
However, this data is much more valuable to Google as it can be
combined with the data of everyone else using the same function.
When this data is aggregated, it will learn:
● The location of all parking lots around the world
● Which parking spaces are the most popular.
● Whether these parking spaces are measured or otherwise
limited in time
● Which parking spaces will be available in the next few
hours?
● Combined with other geographic data, it can learn the
places people would normally park when visiting other
places like specific stores or venues
180
This value only arises when many data points are combined. And
that data stays valuable much longer. Just by accessing your personal
data point, a simple parking reminder service can be offered by
Google. However, with access to the aggregated data points, they can
extract further value. For example from:
● Enhancing maps by using the data points to add parking
spaces or validate those that are already know
● Suggest a parking space if you are planning a trip to the
same city
● Creation of an analysis solution that provides information
about where and when people park in a city
The term data asymmetry refers to any case where access to data is
different. In all cases, this means that the data steward can unlock
more value than a contributor.
When does data asymmetry occur?
In general, data asymmetry occurs in almost every single digital
service or application. Anyone who runs an application automatically
has access to more information than their users. In almost all cases
database, content or transaction history will be collected from the user.
The idea is that there is a power imbalance in a transaction when one
party has access to more information than the other. The markets
freeze or at least huge price premiums are placed on the uncertainty.
181
Bringing more information to all pages enables more efficient
markets.
Data asymmetry and the resulting imbalances of power and value are
most often addressed in the context of personal data. Social networks
that, for example, break down user information in order to target
advertisements. In addition to social networks, other examples of data
asymmetry related to personal data include:
● Smart meters that give you a personal view of your energy
consumption and energy companies get an aggregated view
of the consumption patterns of all consumers
● Health devices that track and report on fitness and nutrition
while developing aggregate views of the health of the user
population
● Activity loggers like Google Fit, which you can use to
record your individual journeys while developing an
understanding of the mobility and use of transport networks
in a larger population
But because asymmetry is so common, it occurs in many other areas.
It is not a specific problem for personal data.
There are many ways to reduce data asymmetry. In general, the
solutions involve either reducing the differences in access to data or
reducing the differences in the ability to extract value from that data.
182
Data protection legislation plays a role in reducing the amount of data
available to an application or service provider. The Data Protection
Act, for example, restricts what types of personal data companies are
allowed to collect, store and pass on. Other examples of reducing the
differences in access to data include:
Users can opt out of providing certain information
Allow users to remove their data from a service
Establish data retention policies to reduce the
accumulation of data
Reducing differences in the ability to extract value from data can
include:
● Give users more insight and indicate when and where their
data will be used or shared
● Grant users or companies access to all of their data, such as:
a full transaction history or set of usage statistics so they
can try to get additional value from them
● Publish some or all of the aggregated data as open data
With the creation of open data - data that can be used by anyone and
the use of which is not restricted - changes are underway in the world
of data science. As more people have access to more information at
less (or no) cost, the asymmetry of information begins to break down.
183
Today's digital platforms such as apps, cell phones, websites and even
social networks enable very fast communication of information in a
format that is relevant to any party's decision. Oftentimes the data is
freely available, which is good for the users of the data.
For example, in the case of Ola, drivers and passengers alike have
access to the same information about the location and availability of
drivers and their demand. This information can affect the outcome of
the economic relationship between the driver and passenger, who both
(or may not) work together based on the same set of data. Airbnb does
something similar with housing.
When the information asymmetry breaks down, the markets work
more efficiently. That doesn't even mean prices have to fall. In times
of big data analyzes, data-driven decisions and the accessibility of
information, the symmetry of information is in increasing demand and
helps us to coordinate supply and demand.
7.6 OTHER IMPACTS
1. Unequal access: Unequal access to data and technology and
information asymmetry lead to unequal opportunities
Not everyone or every organization is in the same starting
position when it comes to big data
184
The digital divide refers, for example, to inequalities
between those who have computers and online access and those
who do not
Access to contact data, a privacy policy or information on
data collection, processing and sharing depends on the functions
2. Discrimination: Unjust treatment of individuals and
organizations based on certain characteristics that lead to
immediate disadvantage and unequal opportunities
People or groups are treated differently depending on
certain characteristics such as age, disability, ethnicity or gender
Big data technologies make it possible to infer initially
unknown characteristics of others in the same or different data
sets
Discrimination against individuals or groups can make
good business sense and is difficult to spot
Data or algorithms that discriminate against people can
be incorrect or unreliable
3. Intrusion: The intrusion into people's privacy and business
practices, resulting in a restriction on freedom
Big data has become integrated into almost all areas of
online usage and to some extent also into their offline
experience
Data is stored over long periods of time and the potential
to analyze the data or integrate it with other data is growing
185
General suspicion of authorities and an insatiable appetite
of organizations for more and more data violate human freedom
The way people live, work, and interact is influenced by
unsolicited Big Data applications
4. Opacity: The lack of transparency of organizational algorithms
and business practices leads to a loss of control
Algorithms are often like black boxes, they are not only
opaque, but also mostly unregulated and are therefore perceived
as indisputable
Individuals and organizations cannot be sure who
collects, processes or shares which data
There are limited ways to verify that an organization has
taken appropriate measures to protect sensitive data
Law enforcement is often limited by the lack of
government resources
Lack of practical experience with audits, including data
protection impact assessments
5. Abuse :The potential for misuse of data and technology leads to
loss of control and deep distrust
Data and big data technologies can be used for illegal
purposes or for purposes that fall within a legal gray area
It is difficult to check the validity of data analysis results
when they appear plausible
186
Data or algorithms can be manipulated to achieve the
desired results
The line between data use and abuse is sometimes blurred
Check your Progress
Fill in the Blanks
1. Algorithms need things to be __________________.
2. Self-selection strengthens the opinions of others with _______
views.
3. ___________________ in most cases is undone with sufficient
effort.
4. When the ___________________ ______________ breaks
down, the markets work more efficiently.
5. Unequal access to data and technology and information
asymmetry lead to __________ ___________________
True or False
1. The smaller the data sets, the greater the likelihood of random
discovering congruent patterns and deviating incorrect
correlations.
187
2. Algorithms which are completely neutral in terms of value,
have learnt to respond to the employer's prejudices, if there
were any.
3. With access to the aggregated data sets, value can be extracted
further.
4. The intrusion into people's privacy and business practices does
not result in a restriction on freedom
5. A leading algorithm-based employment agency determines,
based on data analysis, that candidates with straight hair make
more reliable employees than candidates with curly hair. They
do not tell prospective candidates that they are using this as
their criteria. This is unethical.
188
Multiple Choice Questions
1. Continuous monitoring of multiple aspects of life can greatly
help deliver better medical care to people suffering from
Parkinson’s disease. While such patients can benefit greatly
from personal monitoring, and from research work done on a
cohort of patients like them, there are obvious privacy concerns
given that their every move is being recorded. The only way to
manage this trade-off is to:
a. Limit enrollment to patients who are willing to give up
privacy for the benefits they hope to get
b. Not pursue this kind of extensive monitoring
c. Find ways to limit access to monitored data so only
serious researchers have access, even if this means
cutting out some possible research directions proposed by
researchers outside the mainstream.
2. A travel web site has empirically determined that Mac users are
more willing to pay for higher-priced hotel rooms. Therefore,
the web site modifies the default order in which hotels are
shown, with higher priced hotels ranking slightly higher for
Mac users than for other PC users. Is this reordering
“discrimination” ethical?
a. Yes
b. No
c. Cannot say
189
3. You failed to pay up on a credit card, and this adverse entry is
now in your credit report. In an attempt to fix your score, you
claim that this entry is in error. What does the credit rating
agency do?
a. Fix your entry based on your complaint.
b. Consider your complaint and seek input from the credit
card company.
c. Ignore your complaint.
Activity
Your city has decided to make property tax payment details semi-
public: all you need to do is enter your property identifier to get this
information. Your neighbor has a small business you invested in.
After entering his property ID into the system, you find out he missed
the last two quarters after many years of paying tax on time. You
suspect a cash flow crisis in his business and are demanding your loan
back. Your neighbor is forced to sell some business goods in order to
repay you. Another investor sees this sale of business assets and also
decided to liquidate their investment. That way, problems snowball
until your neighbor is driven out of business. If at all, whose fault is
it? Identify specific steps that have violated an ethical rule.
190
Summary
Even if data analysis is done in a fair way, if analysis is valid, data is
valid, there are no mistakes, there is no intention to discriminate, there
are no privacy issues, there still could be an impact on society in ways
that are not expected.
Distributional Unfairness
Our predictions are probabilistic, we know that sometimes it will be
wrong. We need to understand the social cost of mistakes we make.
We need to realize that the cost of failure is usually different for Type
1 and Type 2 defects. Type 1 mistakes are mistakes where we
mistakenly classify someone as dangerous or criminal or a bad
prospect or something, as opposed to type 2 mistakes where we
mistakenly class someone as safe or not criminal or a perfectly good
employee. Given that we have this asymmetry in the cost of errors, we
need to incorporate this into the algorithm itself and adjust it to
minimize the societal cost. Search engines choose it on purpose to be
broad rather than precise. This could be applied in any classification
algorithm. The hard part is we have to have the weights we put on the
two types of error, but the algorithm needs these things to be
quantified. And since quantification is difficult, and data scientists
will not know what the weight should be, one can do a lot better, than
just assuming that they're equal.
191
Ossification
In data science, analytics has the ability to make it harder for people to
break out of stereotypes because those things are burned into the
algorithms. One consequence, however, is that the recommendations
of such algorithms is that the results of it now reflect existing
prejudices in the interests of a client it was developed for. However,
now when new person from existing organization uses this algorithm,
we added at least a significant delay and made it more difficult for the
employer to remove previous discrimination, as the algorithm is a
force to maintain the status quo. And we know that people tend to
have similar people on their networks and that they are more
comfortable with similar people. So, if we rely on purely human ideas
about the attitudes of algorithms, it is far more likely that our attitudes
are not diverse. It's just that one needs to make sure that this is
explicitly written into the algorithm design. One of the companies that
found itself in such a situation, noted the societal ramifications of
implementing settings with algorithms to decide who to interview and
issued a public statement that commute time was explicitly included in
their settings algorithm was not taken into account because they
believed it had an undesirable social consequence. It is this kind of
self-responsibility for the consequences that we must consistently take
so that we can reap the benefits of data science without doing harm
that we did not intend.
192
Surveillance
More and more data is being collected about ourselves and this will
only keep increasing in the future. We are not new to surveillance
cameras, of course, but we have smart meters that record data on what
is happening to our electricity usage or water usage in ways that old-
style mechanical meters just did not. In an extreme case, we might
decide, for political reasons, that all of this information should be
completely open, accessible to all, in a world where everyone knows
everything and there are no deceptions. The problem with this is that
the delusions can actually play a role. We have an explicit idea of
erasing the past for a fresh start. We know personal relationships are
difficult and sometimes a fresh start is important. We would have to
adjust dramatically if we were to live in a world where those little
white lies were not possible, and we had the glare of total honesty.
However, if it is reduced to a number, that number may not reflect
what we actually had in this large amount of data. We should be
careful not to rely on such quantified numbers, which is often the
record we make. For all of these reasons, in creating this valuable and
enormous record of data, we need to figure out how we control who
has access to what and difficult decisions to make.
Asymmetry
The term data asymmetry refers to any case where access to data is
different. The idea is that there is a power imbalance in a transaction
when one party has access to more information than the other. Data
193
asymmetry and the resulting imbalances of power and value are most
often addressed in the context of personal data. In general, the
solutions involve either reducing the differences in access to data or
reducing the differences in the ability to extract value from that data.
Data protection legislation plays a role in reducing the amount of data
available to an application or service provider. As more people have
access to more information at less cost, the asymmetry of information
begins to break down. For example, in the case of Ola, drivers and
passengers alike have access to the same information about the
location and availability of drivers and their demand. In times of big
data analyzes, data-driven decisions and the accessibility of
information, the symmetry of information is in increasing demand and
helps us to coordinate supply and demand.
Other Concerns that impact the society:
Unequal access: Unequal access to data and technology and
information asymmetry lead to unequal opportunities
Discrimination: Unjust treatment of individuals and
organizations based on certain characteristics that lead to
immediate disadvantage and unequal opportunities
Intrusion: The intrusion into people's privacy and business
practices, resulting in a restriction on freedom
Opacity: The lack of transparency of organizational algorithms
and business practices leads to a loss of control
Abuse :The potential for misuse of data and technology leads to
loss of control and deep distrust
194
Keywords
Ossification: A tendency toward or state of being molded into a
rigid, conventional, sterile, or unimaginative condition.
Surveillance: Surveillance is a targeted form of monitoring,
usually conducted to obtain specific data or evidence and
usually occurs without the person knowing they are being
watched such as data being collected from apps without a
person's knowledge.
Asymmetry: The term data asymmetry refers to any occasion
when there a disparity in access to data
Self Assessment Questions
1. When does data asymmetry occur?
2. What is data surveillance?
3. What is data ossification?
Answers to Check your Progress
Fill in the blanks
1. Algorithms need things to be quantified.
195
2. Self-selection strengthens the opinions of others with similar
views.
3. Anonymization in most cases is undone with sufficient effort.
4. When the information asymmetry breaks down, the markets
work more efficiently.
5. Unequal access to data and technology and information
asymmetry lead to unequal opportunities
True or False
1. False
2. True
3. True
4. False
5. True
196
Multiple Choice Questions
1. c
2. a
3. b
References
1. https://www.researchgate.net/publication/321947050_Societal_I
mplications_of_Big_Data
2. https://www.wired.co.uk/article/china-social-credit-system-
explained
197
Code of Ethics
UNIT
Structure:
8
8.1 Introduction
8.2 What’s next?
8.3 Challenges and Principles
8.4 Code of Ethics
8.5 Areas of Focus for an Ethical Analyst
8.6 Guidelines for Data Analysts
8.7 Conclusion
Summary
Keywords
Self Assessment Questions
Answers to Check your Progress
References
198
Objective:
After going through this unit, you will be able to
Understand code of ethics
Understand guidelines that data analysts should follow for
ethical practices of Data Science.
8.1 INTRODUCTION
There are professional rules in many professions. Let's say
Hippocrates, for example, a Greek physician, introduced this term of
the Hippocratic oath into medicine. And there are a number of things
that are in it, including very famous: “First, do no harm.” Many other
professions, lawyers, journalists, etc. have oaths and codes of conduct
that have been established for their conduct. We need similar code for
data scientists.
The volume of data and the coverage of the data have increased
massively since then, and with it the opportunity to do both good and
bad. In the bucket of good, we find incredible insights by using data to
develop bespoke medical treatments. Crisis Text Line has saved lives
literally every day through a volunteer network of consultants with
powerful data and technology superpowers to help crisis victims. And
199
through the Data-Driven Justice initiative, we have seen that local
counties are able to move their populations in need of mental aid and
drug treatment from our overcrowded prisons to facilities through the
safe sharing of data. Not only do these solutions save money, they are
a proven success.
But we also have to deal with where data does more harm than good.
As Propublica has shown, algorithms are used in the courtroom to
make decisions that negatively affect the race. We know that the data
used in predictive policing can reinforce traditional norms. Let's not
forget that people are stealing our data. From healthcare breaches to
data brokers, we have systems that store our most sensitive data with
minimal control and protection. Finally, our democratic systems have
been attacked using our own data to incite hatred and sow discord.
With great power comes great responsibility. Now is the time to take
the lead in deciding what is right and wrong with data. Similar to how
the Hippocratic Oath defines ‘Do No Harm’ for the medical
profession, the data science community must have a set of principles
in order to lead and hold each other as data science experts
accountable. Understanding the difference between helpful and
harmful, guiding and driving each other forward to put responsible
behavior into practice and to strengthen the masses instead of
disenfranchising them, data is an incredible lever for change. A
tangible and accessible tool underlines the dark side of AI, computer
vision and other features of machine technology in the wrong hands.
The viability of this technology was mostly in the hands of the
technical people who understood it.
200
There is no single voice that determines this choice. It must be a
community effort. Data science is a team game and we have to decide
what kind of team we want to make. We need to make sure that the
change that is coming is what we all want to see.
8.2 WHAT’S NEXT?
Having a fine line between regulation and stifling renovation can lead
to a comprehensive investigation into the actual accountability of the
work done by data scientists. Any instrument can be used for good or
bad, but one clear line in the sand is the belief that the tools you build
should have a greater use that far outweighs the bad.
Other professions more closely associated with the data revolution
have newer codes. The recently established Data Science Association
offers a relatively detailed code of ethics, specifically describing how
members should closely adhere to scientifically based statistical
methods. Some data science sub-disciplines have also established
valuable codes of ethics and other types of ethical guidelines for their
members. The Association of Internet Researchers (AoIR) developed
a code of ethics in 2002 that was updated in 2012 and addresses the
obligations of social science researchers working at the macro level in
digital areas.
Let's look at the options. First, regulation is not the answer.
Technology is evolving rapidly. Regulation moves slowly. If we rely
on regulation, we will regulate yesterday's technology, and because
we regulate yesterday's technology, we will allow many abusers to
201
comply with outdated regulations. There may also be benefits from
new technology and those benefits may not be allowed because they
conflict with outdated or unnecessarily bulky regulations. So, in
general, it is a good idea to have regulations that follow the law so that
things that are already a matter of social consensus are established on
the basis of ethics.
As an example of slow regulations, here is a sign in an amphitheater.
The unauthorized use of tape recorders and cameras is not permitted.
So what they presumably mean is that unauthorized recording of any
kind is not allowed, but they said tape recorders and cameras. And the
question is if I want to record a concert and record it on my phone, is
that okay? My cell phone is not a tape recorder. And under the law
here, that's probably fine. And this is a matter of the law of wording
precisely in a world where technology is changing rapidly.
The downside of regulation is compliance. Businesses and individuals
must comply with the law, and most large corporations have units that
are responsible for ensuring that companies comply with applicable
laws. However, this is not a positive approach. Oftentimes, when you
want to think about compliance, you think about the minimum
required to comply with the law. And the law, as we said earlier, is
something that follows and is slow to adapt to technology. So if you
are trying to do the bare minimum to meet the letter of the law, you
are not looking to the future. One thinks of yesterday's battles. I think
this is the right thing for companies and I think this will actually put
them in a better position in the future when they think about what the
intent of the law was or is and that is the social consensus and what is
202
the ethical position. And then they do the right thing. Companies
know that if they annoy customers, they lose. And so they are
motivated to regulate themselves. And there are a number of trade
associations that have been formed to help with these types of
questions.
For example, there are some that focus on advertising-related topics,
especially on the internet. And each of these associations tries to
formulate principles of things, the rules for us, the companies, against
them, the customers, and what companies want from these trade
associations are actionable rules. You can have rules that are less late
than legal requirements. But there are still rules that usually follow
rather than lead. There are some forward-thinking corporate attorneys
out there who have pondered where the future will lead. And so it is
certainly not the case that companies fail to think about what the
future holds in terms of technology and how to be responsible citizens.
But as data scientists we should have our own destiny. We shouldn't
have corporate lawyers defining this for us. As a data scientist, you
should be passionate about the good things data science can do, and
want to be ethical so that you can continue to be proud of what you do
and that we can continue to thrive as a subject that society values us
and appreciate the benefits our work brings to them. There are only
two simple but overarching principles.
1. Do not surprise.
Don't surprise the subject of the data you've recorded, collected, used,
and analyzed. You may get surprising results from your analysis. It's
good. These results are surprising to the person who wanted the
203
analysis to be carried out. The point really is that you don't want the
subject of data to be taken by surprise because they didn't expect you
to collect or use their data in the specific way that you chose to use
them. For example, the data subject signed the multi-page, fine print
informed consent form, so it is not a surprise, or it should not have
been a surprise. Not surprising means that there is no surprise in how
most people act, behave, and think.
2. Own the outcomes.
As data scientists, we unleash a technical process. This technical
process has social implications, so it is not enough to say that there is
nothing wrong with the technical process. We don't have a bug in our
code. We take the data we get and spit out any results the algorithm
spits out of that data. There is nothing wrong with the algorithm. That
is not enough. We need to understand the results. We have to own the
results. And if the process produces undesirable results, we need to
figure out how to fix the process. Other business organizations have
their own code of conduct. There have been many codes for data
science suggested by others.
The coverage of these codes varies. If you have a code with lots of
dots and lots of detail it can be a very specific action plan, but then it's
not memorable. And then these things should make us think further
about the kind of things we are discussing in this course. Not
surprisingly, things like the owner of the data, what the data can be
used for, are treated like this. Outcomes include things like what is
valid, what is fair, what the social consequences are.
204
8.3 CHALLENGES AND PRINCIPLES
Challenges for a universal code of data ethics
One of the unique features of today's datasets is their wide, multi-
disciplinary utility - data science is closer to service than a branch
because it is useful in many industries and disciplines. Analytical tools
developed in applied mathematics, statistics and computer science are
being taken up in subjects such as medicine, marketing, finance,
humanities, social sciences, criminal justice, geography and geospatial
imaging, manufacturing, social work, human rights and much more.
This is a major challenge to the universal code of data ethics: specific
uses of data science to bring a single code together may have very
little commonality.
The principles of data ethics in medicine may not be in the financing
because there is a significant difference between the medical
professional and the social role of the financier. Professional sub-
societies in other regions have solved such problems due to the
creation of secondary ethics codes. If data science is on the path to
ubiquity, it can be challenging to define a universal code that covers
its use in a variety of contexts. Some data science sub-disciplines also
produce valuable ethics codes and other types of ethics guidance for
their members.
205
Principles for Data Ethics
Data science professionals and practitioners should strive to uphold
these principles:
1. The highest priority is to respect the people behind the data.
When the insights gained from the data can affect the human
condition, the potential harm to the individual and the
community should be considered in the most appropriate way.
Big data can create compelling insights about the population, but
those same insights can be used to unjustly limit an individual’s
possibilities.
2. Go downstream usage of the dataset. Data professionals should
try to use data in a way that is consistent with the intent and
understanding of the party disclosing it. Many rules govern
datasets based on the status of data, such as “public,” “private,”
or “ownership.” However, what is done with the dataset is
ultimately more effective for the subject / user than for the data /
type or context of data collection. The correlated use of research
and reproduced data in the industry represents both great
assurance and the greatest risk posed by data analysis.
3. Try to match security and protection with privacy and security
expectations. Data subjects have a wide range of expectations
regarding the privacy and security of their data, and those
expectations are often context-based. Designers and data
professionals should consider those expectations appropriately
and align security and expectations as much as possible.
206
4. Data analysis and analytical tools shape the outcome of their use.
Raw data is nothing like that - all datasets and accompanying
analytics tools have a history of human decision making. At
most, that history should be audible, including the technology,
including the context of the collection, the methods of consent,
the chain of responsibility, and the assessment of the quality and
accuracy of the data.
5. Data can be a means of inclusion and exclusion. Everyone is
entitled to the social and economic benefits of data, but not
everyone suffers equally from the process of data collection,
correlation, and prediction. Data professionals should strive to
minimize adverse effects on their products and listen to the
concerns of the affected community.
6. Explain analysis and marketing methods to as many data analysts
as possible. Increasing transparency during data collection can
reduce the risk when data is transmitted through the data supply
chain.
7. Data scientists and practitioners must accurately demonstrate
their qualifications, the limits of their skills, adherence to
professional standards, and striving for the accountability of
chiefs. The long-term success of this sector depends on the trust
of the public and consumers. Data professionals need to develop
methods to hold themselves and their friends accountable to
shared standards.
207
8. Transparency, connectivity, accountability and audibility. Not all
ethical dilemmas have design solutions, but being aware of
design methods breaks down many practical barriers that stand in
the way of shared, strong ethics. Data ethics is an engineering
challenge that deserves the best consideration in the field.
9. Products and research methods should be subject to internal and
potential external ethical review. Institutions should prioritize
establishing consistent, efficient, and proactive code of conduct
review methods for new products, services, and research
programs. Internal peer-reviewed methods can reduce risk, and
external review boards can make a significant contribution to
public trust.
10. Governance systems should be robust, known to all team
members and reviewed regularly. Data codes of conduct pose
organizational challenges that cannot be solved by familiar
compliance mechanisms alone. As the regulatory, social and
technical fields are highly displaced, organizations involved in
data analysis need collaborative, regular and transparent methods
for ethical governance.
8.4 CODE OF ETHICS
Most people might think about using such a code to define
accountability. However, the real concept behind a code of ethics is
208
not accountability per se, but the idea that the group can collectively
agree on a set of basic principles.
These principles guide actions on a systemic level to ensure that an
individual's moral compass points in the right direction when his or
her own values or beliefs are challenged. Some groups are currently
trying to define these, such as the Data Science Association, Alan
Fritzler's Data Science for Social Good, and the Oxford-Munchin
Code of Conduct, to name a few.
While these are great, detailed attempts to cover a wide variety of
roles and work being done by a data scientist, we may need a smaller,
more focused set of values to agree on:
Non-maleficence
Data scientists should work towards the good of humanity in a way
that does not intentionally cause harm. While the use of machine
learning for autonomous vehicle driving is for the common good, the
conversation becomes more complicated when it is applied to military
vehicles. Data scientists need to think through the implications of their
actions to better understand whether the final creation will do more
good than harm.
Statutory
It goes without saying that data scientists should not only comply with
the laws of their own country, but also comply with internationally
agreed regulations. Data scientists should be directly aware of the
legal implications of their creations. The act of sharing something in
209
order to share it in front of someone else, under the guise that it would
be inevitable, is unacceptable.
The Greater Good
Not all creation is just good or bad. Perhaps one of the most
challenging positions for a data scientist is the fact that they do mostly
research. It is not always possible to control the conduct of such
research. But as a group, research should be for the common good.
Notability
The idea of hiding behind anonymity for fear of the possible outcome
should not be a way of protecting yourself. The names of the data
scientists are tied to the research and, in turn, are directly related to the
implementation. Standing behind your own work and its derivations
can ensure that the creations it inspires carry a certain level of
responsibility.
The above suggested values are just a reaction to something alarming
that may only get worse as this deep, fake space continues to grow
based on good research by humble data scientists. There is even
bigger talk about collecting and using data that goes into machine
learning. What about the privacy of people directly or indirectly
related to the outcome of the work? How can laws be created to
protect data scientists and those affected? It's too big of a topic, but it's
a conversation we need to start now.
210
8.5 AREAS OF FOCUS FOR AN ETHICAL ANALYST
These Code of Ethics Guidelines for Ethical Analysts are
straightforward. The three key areas every data analyst should focus
on:
1. Test and learn: As an organization, the opportunity to
implement testing and learning gives you the opportunity to
experiment and iterate. This also gives you the freedom to make
mistakes and improve yourself. This reduces possible errors on
the road.
2. Impact: Your role as a data analyst is to improve business
results. You must act as a source of truth for your organization.
Direct people to data-based decisions and away from these “gut
feelings”.
3. Accountability and Transparency: It's easy to get overwhelmed
by strategy and maturity models, changes to the search
algorithm, tool upgrades and reports on Monday morning. Be
accountable and transparent about your data. You can be an
analyst on the first day of your job or an analytics old hand.
Ownership of your data is at stake in the data analytics industry.
Many industries have professional standards of conduct that help
members of their respective communities by providing guidelines to
strive for and guidelines for dealing with ethically challenging
situations. In the digital analytics industry in particular, the Digital
Analytics Association (DAA) has published the code of ethics for web
211
analysts. These ethical guidelines contribute to open, trusting and
cooperative relationships with customers.
The Analyst Code of Ethics focuses on three topics:
1. Privacy
a. To protect your customers' data.
b. Do not collect data just to collect data.
c. Stay informed about data protection laws.
2. Integrity
a. Strictly enforce data quality.
b. Not to misinterpret, fabricate or embellish data.
c. Thoroughly explain and document your data.
3. Honesty
a. Tell the truth even if it's bad news.
b. To be open and accountable when mistakes are made.
c. Follow the golden rule.
1. Privacy:
Customers provide you with personally identifiable and highly
sensitive information. Pledge to protect all sensitive information that
is provided about the customer. Also, promise to let them know of any
personally identifiable information that may be accidentally collected
that violates a product's terms of use. Educate your customers about
consumer data collection practices and laws and advise them to keep
customer data safe, private and secure.
212
While traditional evidence suggests that more data is better, know that
collecting unnecessary data is both risky and rarely provides useful
insights. By focusing on collecting critical data points, you will get
higher quality data and reduce the chance of highly sensitive data
being inadvertently collected. Let your clients know what data they
should and shouldn't be collecting from your customers, and avoid the
“keep it all” mentality.
Consumer privacy laws are changing rapidly, and penalties for
violations are severe. With websites and mobile apps accessible from
anywhere in the world, understanding which laws apply to which
users and which websites / apps becomes a challenge. Keep up to date
with consumer privacy laws to better work with your customers and
their legal teams and to comply with regulations.
2. Integrity:
Insights are only as good as the data behind them, and take data
quality very seriously. On the data collection side, focus on designing,
implementing and validating data to ensure that it is correct,
actionable and trustworthy. Don't blindly trust the data in the tool, but
do have a deep understanding of the basics of what drives those
numbers, how exactly to get reports and how to challenge the data
when it doesn't add up.
Careless, uninformed, biased, or intentionally misleading use of data
is negligent at best and completely illegal at worst. Recognize that
decision making and analysis involve cognitive biases and strive to be
213
a neutral third party by interpreting data regardless of such biases,
political pressures, or a sense of self-preservation.
Without an understanding of how and where data is being collected on
your website or app, your ability to interpret the data and gain real
insights will be compromised. Provide documentation that is written
in a non-technical language and describes the data collection process
so that your customers can have a full understanding of their data sets.
3. Honesty:
The goal is to help your customers optimize and improve their
marketing efforts and online conversions. In times when marketing or
website performance is not improving, pass on the results even if your
customers did not hope so, even if it was your idea that didn't work.
Strongly believe that honest bad news and suggestions for
improvement are better than false good news. And if there is bad news
to share, never surprise your customers at an inopportune time or
share it with the wider customer team before you let your main
customer stakeholder know.
Do everything you can to be thorough and accurate, but sometimes
errors inevitably occur. If you make mistakes, adopt them and do
everything you can to correct the situation.
Never advise a customer to make a decision that you would not make
for your own business. When you work with clients, approach all
aspects of any project as if you were part of their organization and
strive to prepare well for the future.
214
8.6 GUIDELINES FOR DATA ANALYSTS
Data analysts encounter gray areas all the time. Below are 8 guidelines
to tackle such gray areas.
1: Protect Your Customer
Your customers often share a lot of personal information. PII
(personally identifiable information) should be strictly protected. As a
rule of thumb, ask a friend who is not in the business if they would
like this data to be published about them. If the answer is no, then
proceed accordingly.
2: Be the Bearer of Bad News
Don't be afraid of bringing bad news, it is easy to get carried away
with the desire to show continuous progress or growth even when
there is none. If the data reflects bad news, let them know why.
3: Don’t Torture the Data
It is unprofessional to use data in a way that is either carelessly or
intentionally incorrect. Anyone with a little knowledge of Excel can
make a chart look better than it should. Do not create a phantom trend
or exaggerate a slight surge in numbers. Also, remember that owning
your data is important. Know the data back and forth so you don't
accidentally lie about incorrect information.
215
4: Don’t Play Favorites
As much as you'd like your analysis to always have that gold nugget
of data, sometimes it just doesn't exist. Don't make a point that really
belongs in the appendix of your PowerPoint.
5: Don’t Lie
It seems obvious. The hallmark of every great analyst is trust. If your
numbers are wrong, if there is a typo; own it. Address error. Trust
pays off in the long term. Your reps will remember your integrity long
after they forget you got the wrong number on slide 12.
6: Understand the Role of Data Quality
Data quality is a big problem. Reporting bad data as true is worse than
not reporting any data. Do not blindly believe what your analysis tool
is telling you. Understand the basics of what's in these numbers
7: Am I Improving the Business?
Ask yourself, “Am I doing this business?” The end customer and the
value of your company must always be kept in mind. Make it a rule,
not an exception, that customers own their data.
8: Data Governance is Critical
Data governance plays a role in your work. It is necessary to bring
order and democratize the data. However, less control equates to more
ignorance. Have good documentation and take the time to train your
internal and external customers.
216
8.7 CONCLUSION
Finally, ethics is important. After all, companies are realizing that data
is power. The power of data can be used for nefarious purposes and
this raises concerns about ethical issues in data analysis. We are at a
time when lies spread faster than the truth, and we see controversial
and sometimes downright illegal use of sensitive information. Some
people comment and even resign from their jobs to avoid misleading
facts from persisting.
As data analysts, we face new and greater challenges every day. We
have to deal with sometimes daunting data hurdles. The GDPR is a
first step towards addressing consumer privacy concerns. While
organizations like the Digital Analytics Association have established a
code of ethics, there has been little to keep companies and analysts in
check. We believe that a code of ethics for data analysts is
fundamental, and encourage others to follow a professional standard.
It is difficult for the average analyst to influence how data protection
is handled at the corporate level. Still, there is an ethical standard that
we can adhere to when analyzing and reporting data.
Data science has great power, i.e. great power to harm and great
power to help. Data scientists cannot hide behind the claim of neutral
technology and ignore how this technology is actually being used. In a
complex world, ethical analysis is not easy. In any real-world
situation, it is not easy to see where the problem is and not even to
understand how the boundaries are defined. And how to decide
217
whether something is right or wrong. Finally, I would like to quote
Erika Andersen, who said that doing the right thing does not
automatically lead to success. However, compromising ethics almost
always leads to failure.
Check your Progress
Fill in the blanks
1. The downside of regulation is ____________.
2. The highest priority is to
_____________________________________.
3. Data can be a means of _______________ and
________________.
4. The opportunity to implement testing and learning gives you the
opportunity to _________________ and _______________.
5. The ______ is a first step towards addressing consumer privacy
concerns.
218
True or False
1. Given the economic utility basis for ethics, we should expect
even “soulless” companies to behave ethically in their own self-
interest.
2. The viability of technology is mostly in the hands of the
technical people who understood it.
3. A single code that brings together specific uses of data science
will have a lot of commonality.
4. Data scientists should only comply with the laws of their own
country.
5. Reporting bad data as true is worse than not reporting any data.
Multiple Choice Questions
1. Many companies have “Compliance and Ethics” programs. The
US Sentencing Commission, in its guidelines regarding
corporate wrongdoing prescribes: “compliance and ethics
program shall be reasonably designed, implemented, and
enforced so that the program is generally effective in preventing
and detecting criminal conduct.” In light of what you have
learned in this course, this formulation is problematic because:
a. Preventing criminal conduct is a very low ethical bar: we
should expect more
219
b. Compliance is about what you must do and ethics is
about what you should do
c. Both of the above
2. You work for a major cell phone service provider, and have
access to large volumes of detailed location data for your
customers. One day, you are able to correlate and determine
whether your customer is indoors or outdoors. On this basis you
lead to a new signal amplification algorithm that is amazingly
effective in improving call quality. You surprise yourself, your
boss, and your company, with these results. This analysis does
not violate the “Do Not Surprise” rule.
a. Yes
b. No
c. Can’t Say
3. You run a for-profit public corporation and you have found a
perfectly legal way to arrange your business that will
substantially decrease the taxes you have to pay. Ethically
speaking, should you do it?
a. Yes. Your first responsibility is to the shareholders and to
maximize their profit.
b. No. Everyone must pay their fair share of taxes. Paying
less is unethical.
c. It depends.
220
Activity
Write a case study related to ethics in data science (preferably related
to your surroundings or that you yourself have experienced).
Summary
The volume of data and the coverage of the data have increased
massively and with it the opportunity to do both good and bad. Now is
the time to take the lead in deciding what is right and wrong with data.
Similar to how the Hippocratic Oath defines 'Do No Harm' for the
medical profession, the data science community must have a set of
principles in order to lead and hold each other as data science experts
accountable. Data science is a team game and we have to decide what
kind of team we want to make.
Any instrument can be used for good or bad, but one clear line in the
sand is the belief that the tools you build should have a greater use that
far outweighs the bad. So, in general, it is a good idea to have
regulations that follow the law so that things that are already a matter
of social consensus are established on the basis of ethics. And the law,
as we said earlier, is something that follows and is slow to adapt to
technology.
There have been many codes for data science suggested by others. The
coverage of these codes varies. When you have code with lots of
detail, this can be a very specific plan of action, but it's not easy to
remember. You have to remember a simple two dot code. Six Words
That You Will Remember. Don't surprise the customer and own the
221
outcomes. Not surprisingly, things like the owner of the data, what the
data can be used for, are treated like this. Outcomes include things like
what is valid, what is fair, what the social consequences are.
Challenges for a universal code of data ethics
One of the unique features of today's datasets is their wide, multi-
disciplinary utility - data science is closer to service than a branch
because it is useful in many industries and disciplines. This is a major
challenge to the universal code of data ethics: specific uses of data
science to bring a single code together may have very little
commonality. If data science is on the path to ubiquity, it can be
challenging to define a universal code that covers its use in a variety
of contexts.
Principles for Data Ethics
When the insights gained from the data can affect the human
condition, the potential harm to the individual and the community
should be considered in the most appropriate way. Data professionals
should try to use data in a way that is consistent with the intent and
understanding of the party disclosing it. The correlated use of research
and reproduced data in the industry represents both great assurance
and the greatest risk posed by data analysis. Data subjects have a wide
range of expectations regarding the privacy and security of their data,
and those expectations are often context-based. Everyone is entitled to
the social and economic benefits of data, but not everyone suffers
equally from the process of data collection, correlation, and
222
prediction. Data professionals should strive to minimize adverse
effects on their products and listen to the concerns of the affected
community. Data scientists and practitioners must accurately
demonstrate their qualifications, the limits of their skills, adherence to
professional standards, and striving for the accountability of chiefs.
Code of Ethics
Most people might think about using such a code to define
accountability. However, the real concept behind a code of ethics is
not accountability per se, but the idea that the group can collectively
agree on a set of basic principles.
Non-maleficence: Data scientists should work towards the good of
humanity in a way that does not intentionally cause harm.
Statutory: Data scientists should not only comply with the laws of
their own country, but also comply with internationally agreed
regulations.
The Greater Good: It is not always possible to control the conduct of
such research. But as a group, research should be for the common
good.
Notability: Standing behind your own work and its derivations can
ensure that the creations it inspires carry a certain level of
responsibility.
223
Areas of Focus for an Ethical Analyst
The Analyst Code of Ethics focuses on three topics:
1. Privacy
a. To protect your customers' data.
b. Do not collect data just to collect data.
c. Stay informed about data protection laws.
2. Integrity
a. Strictly enforce data quality.
b. Not to misinterpret, fabricate or embellish data.
c. Thoroughly explain and document your data.
3. Honesty
a. Tell the truth even if it's bad news.
b. To be open and accountable when mistakes are made.
c. Follow the golden rule.
Guidelines for Data Analysts
1. Protect Your Customer
2. Be the Bearer of Bad News
3. Don’t Torture the Data
4. Don’t Play Favorites
5. Don’t Lie
6. Understand the Role of Data Quality
7. Am I Improving the Business?
8. Data Governance is Critical
224
Conclusion
The power of data can be used for nefarious purposes and this raises
concerns about ethical issues in data analysis. We believe that a code
of ethics for data analysts is fundamental, and encourage others to
follow a professional standard. It is difficult for the average analyst to
influence how data protection is handled at the corporate level. Still,
there is an ethical standard that we can adhere to when analyzing and
reporting data. Data scientists cannot hide behind the claim of neutral
technology and ignore how this technology is actually being used. In
any real-world situation, it is not easy to see where the problem is and
not even to understand how the boundaries are defined. And how to
decide whether something is right or wrong.
Keywords
Maleficent: Causing harm or destruction especially by using
power in the wrong way.
Tangible: Perceptible by touch.
Self Assessment Questions
1. What are code of ethics?
2. Describe any 4 guidelines for data analysis.
3. Give principles of data ethics.
225
Answers to Check your Progress
Fill in the Blanks
1. The downside of regulation is compliance.
2. The highest priority is to respect the people behind the data.
3. Data can be a means of inclusion and exclusion.
4. The opportunity to implement testing and learning gives you the
opportunity to experiment and iterate.
5. The GDPR is a first step towards addressing consumer privacy
concerns.
True or False
1. True
2. True
3. False
4. False
5. True
Multiple Choice Questions
1. b
2. a
3. c
226
References
1. https://www.datascienceassn.org/code-of-conduct.html
2. http://www.code-of-ethics.org/code-of-conduct/
227