Structured Data
Contemporary Challenges in
Data Policy and Governance
Lecture #05
14-08-2025 PS-632 CPS IIT Bombay 1
Data Types
Data Types
Primitive or Composite or
Abstract Data
Atomic Data Aggregate
Types
Types Data Types
Numbers Arrays Formal
Booleans Lists specifications
Characters Sets of the
Enumerations Dictionaries operations on
Strings Records values from a
Timestamps (tuples) specified
domain
14-08-2025 PS-632 CPS IIT Bombay 2
Encodings for atomic data
• Characters
– ASCII, EBDIC, UTF-16, UTF-32
• Numbers
– IEEE 754 format (latest version ISO/IEC/IEEE 60559:2020)
• Other atomic data types are actually composites that use the
basic encodings for each number or character
14-08-2025 PS-632 CPS IIT Bombay 3
Composite data
• The data was stored separately from its interpretation
• The programming language allowed the definition of the
composite type i.e. string, array, record, struct etc.
• Physical storage was allocated based on the programming
language specifications and so interoperability required
knowing specific physical layouts and encodings for data
• This knowledge of layout could in some sense be called
“metadata”
14-08-2025 PS-632 CPS IIT Bombay 4
Describing the syntax of data
• Each atomic data item has
– A type – described a set of possible values
– A value – one of the permissible elements of the set defined by the type
• Composite data consists of atomic data combined together in
various ways
• The structure of data described by using various combining
methods starting with atomic data is called defining the abstract
syntax of data.
• To concretely store and process data the abstract data has to be
“encoded” into a concrete representation
14-08-2025 PS-632 CPS IIT Bombay 5
An example
The structure depicted is a binary tree. Each node
has upto two child nodes. The number in the box
5
can be thought of as the “value” of the node.
Each node can be uniquely described by a “selector”.
A selector is of the format <path>:<value> where
path is a sequence of 0s and 1s separated by dots.
7 2
For example the root node is just :5
The node with value 4 is 0.0:4
The node with value 8 is 1.1:8
The node with value 3 is 1.0:3
4 3 8 Here 0 refers to the left child and 1 refers to the
right child of a node. All selectors begin at the root.
The whole tree can be represented by the set {:5, 0:7, 1:2, 0.0:4, 1.0:3, 1.1:8}
Representing a complex structure in a linear sequence of symbols is called “serialization”
If a structure can be unambiguously represented by more than one sequence then a
14-08-2025 standard or “canonical” representationPS-632
canCPS
make processing more efficient
IIT Bombay 6
Some widely used serialization standards
• ASN.1
• XML
• JSON
• Some programming languages support “object serialization”
such as pickle for Python.
• Other examples of languages providing native support for
object serialization are Java, .NET family, PHP, Ruby and
Smalltalk
14-08-2025 PS-632 CPS IIT Bombay 7
Abstract Data Types
• Describe the data from a user’s point of view
• Define the operations that are permissible on the actual data
items without burdening the users with the low level
implementation details
• Allow users to construct more general algorithms and express
them more compactly
14-08-2025 PS-632 CPS IIT Bombay 8
An Example – The list of numbers datatype
• A list intuitively contains many elements one after the other. You
can get the first element, add an element to the end of the list,
append a list to the end of a list and delete the first element from a
list. This could be described as
• Createlist: -> list (creates and empty list)
• Head: list -> number (gets the first element)
• Additem: (list, number) -> list
• Append: (list,list) -> list
• Deletefirst: list ->list
14-08-2025 PS-632 CPS IIT Bombay 9
An Example – The list of numbers datatype
newlist = Createlist()
newlist is the empty list
list1 = Additem(newlist,1)
list1 is [1]
list2 = Append(list1,[2,3,4])
list2 is [1,2,3,4]
list3 = Deletefirst(list2)
list3 is [2,3,4]
x = Head(list3)
x is 2
14-08-2025 PS-632 CPS IIT Bombay 10
Data persistence
• Most applications require that data acquired from human input,
sensors or processing be stored so that it can later be used for
further processing or queries.
• This property of the data being stored for later retrieval is called
persistence.
• Data needs to be persisted to a storage medium like a magnetic
tape (historical), magnetic disk drive or solid state disk drive
• A format needs to be decided to store data for later retrieval.
• The simplest and most commonly used method is through the use
of files.
14-08-2025 PS-632 CPS IIT Bombay 11
Files
• A file is conceptually a “list” or “records” that is given a unique
name and is stored on a tape of disk.
• Each record consists of a number of “fields”
• Fields may contain data of different types like strings, integers
or floating point numbers
• The main operations that can be performed on files are
creating a file, adding a record to a file, fetching a record from
a file, deleting a record from a file and deleting a file
14-08-2025 PS-632 CPS IIT Bombay 12
Example of file usage
filename
CPS703 Marksheet.txt field
Roll No Name Mid Sem-1 Mid Sem-2 End Sem
1 Ajay 75 84 89
2 Sunil 83 76 90
3 Jyoti 91 89 93
record
4 Thomas 78 91 85
5 Renu 86 72 81
6 Sunita 72 87 89
7 Aijaz 93 85 87
14-08-2025 PS-632 CPS IIT Bombay 13
Do we need more than files?
• Files are the most basic form of persistent storage. Finally every bit of persistent data is
written into a file for storage
• The question really is does a user need a better abstraction for data manipulation than the
list of records composed of fields abstraction for a data file.
• The file abstraction for data is good for single applications like the student marks example.
When many applications produce some data of their own and use data produced by other
applications as well there may be a need to relate one data set to another. This needs
abstractions that are more sophisticated than the basic file abstraction
• We need to represent not only the raw data but also the relationships between various data
elements and data sets. This activity is called data modeling and its product is the data
model
14-08-2025 PS-632 CPS IIT Bombay 14
Data Modelling
Most often the
Physical Data Model
is a relational database
schema
https://en.wikipedia.org/wiki/Data_modeling
14-08-2025 PS-632 CPS IIT Bombay 15
Data Modelling
Most often the
Physical Data Model
is a relational database
schema
https://en.wikipedia.org/wiki/Data_modeling
14-08-2025 PS - 632 CPS IIT Bombay 16
An Introduction to Data Modeling
• Let us look at a practical introduction to data modeling by
actually carrying out this activity for a specific application
• The application we will choose is an application similar the
CoWIN application
14-08-2025 PS - 632 CPS IIT Bombay 17
Conceptual Data Model
• Entities
• Attributes
• Relationships
• Integrity Rules
14-08-2025 PS - 632 CPS IIT Bombay 18
Entities
• Registered user – This is the person who creates the primary
account
• Beneficiaries – These are the actual recipients for the vaccination to
be received
• Hospital
• Vaccinator
• Vaccine – This is the type of vaccine and its batch number
• Appointment
• Vaccination certificate
14-08-2025 PS - 632 CPS IIT Bombay 19
Attributes of Registered User
The attributes of a registered user are
• Mobile_number : string
14-08-2025 PS - 632 CPS IIT Bombay 20
Attributes of a Beneficiary
The attributes of beneficiary are
• Linked_Mobile_number : string
• Beneficiary_id : string
• First_name : string
• Last_name : string
• Middle_names : string
• Photo_id_type : string
• Photo_id_number : string
• Gender : character
• Year_of_birth : integer
14-08-2025 PS - 632 CPS IIT Bombay 21
Attributes of a Hospital
The attributes of a hospital are
• Hospital_id : string
• Hospital_name : string
• Hospital_state : string
• Hospital_district : string
• Hospital_pincode : string
14-08-2025 PS - 632 CPS IIT Bombay 22
Attributes of a Vaccinator
The attributes of a vaccinator are
• Linked_Mobile_number : string
• Vaccinator_id : string
• First_name : string
• Last_name : string
• Middle_names : string
• Photo_id_type : string
• Photo_id_number : string
• Gender : character
14-08-2025 PS - 632 CPS IIT Bombay 23
Attributes of a Vaccine
The attributes of a vaccine are
• Vaccine_name : string
• Manufacturer : string
• Batch_number : string
14-08-2025 PS - 632 CPS IIT Bombay 24
Attributes of an Appointment
The attributes of an appointment. If we assume that
Beneficiary_id is unique and a Hospital_id is unique then an
appointment can be modeled as
• Beneficiary_id : string
• Hospital_id : string
• Date : date
• Time : time
14-08-2025 PS - 632 CPS IIT Bombay 25
Attributes of a Vaccination Certificate
The attributes of a Vaccination Certificate are
• Vaccination_certificate_id : string
• Certificate_type : character (this could be provisional or final)
• Beneficiary_id : string
• Beneficiary_name : string
• Beneficiary_gender : character
• Beneficiary_age : integer
• Vaccinator_id : string
• Vaccinator_name : string
• Hospital_id : string
• Hospital_address : string
• Date_of_vaccination : date
• Time_of_vaccination : time
• Vaccine_name : string
• Vaccine_batch_number : string
14-08-2025 PS - 632 CPS IIT Bombay 26
Relationships – Registered user to beneficiaries
Each registered user is associated with some beneficiaries
Beneficiary 1
Registered
User Beneficiary 2
Beneficiary n
14-08-2025 PS - 632 CPS IIT Bombay 27
Relationships – Beneficiary to Appointment
Each beneficiary may book an appointment or reschedule an
appointment
Appointment 1
Beneficiary
Appointment 2
Appointment n
14-08-2025 PS - 632 CPS IIT Bombay 28
Relationships – Vaccine to Hospital
Each batch of vaccine is allotted to several hospitals
Hospital 1
Vaccine
Hospital 2
Hospital n
14-08-2025 PS - 632 CPS IIT Bombay 29
Relationships – Vaccinator to Hospital
Each vaccinator is associated with a hospital
Hospital 1
Vaccinator
Hospital 2
Hospital n
14-08-2025 PS - 632 CPS IIT Bombay 30
Relationships – Vaccination Certificate to
beneficiary to hospital to vaccine to vaccinator
Beneficiary Vaccine
Vaccination
Certificate
Vaccinator Hospital
14-08-2025 PS - 632 CPS IIT Bombay 31
Integrity Rules
• Each registered user can enroll upto 4 beneficiaries including
themselves
• Each beneficiary can have at most one appointment confirmed at
any given time
• Each vaccinator can be associated with exactly one hospital at any
given time
• The vaccine batch number in a vaccination certificate must have
been allotted to the hospital mentioned in the vaccination
certificate
14-08-2025 PS - 632 CPS IIT Bombay 32
Summary
• Data is represented by certain symbols which by themselves do not have any inherent
meaning
• In order to associate meaning with the symbols that are used to represent data, “structure”
has to be associated with these symbols. At the lowest level this involves standardized
encoding for atomic data with a hierarchical method of describing structure for composite
data all the way to defining a “Data Model”
• The process of Data Modeling is broken up into two parts
– Creating the conceptual data model
– Creating the physical data model
• The conceptual model consists of describing the entities, their attributes, the relationships
amongst those entities and the integrity rules that the data set needs to satisfy
• Building the conceptual data model involves a study and careful analysis of the problem that
is sought to be solved
14-08-2025 PS - 632 CPS IIT Bombay 33