UNIT-1
UNIT I: Introduction to Data science, benefits and uses, facets of data, data science process in
brief, big data ecosystem and data science.
Data Science process: Overview, defining goals and creating project charter, retrieving data,
cleansing, integrating and transforming data, exploratory analysis, model building, presenting
findings and building applications on top of them.
Benefits and uses of data science and big data:
❖ Data science and big data are rapidly growing fields that offer a wide
range of benefits and uses across various industries. Some of the benefits
and uses of data science and big data are:
1. Improved decision-making: Data science and big data help
organizations make better decisions by analyzing and interpreting
large amounts of data. Data scientists can identify patterns,
trends, and insights that can be used to make informed decisions.
2. Increased efficiency: Data science and big data can help
organizations automate tasks, streamline processes, and optimize
operations. This can result in significant time and cost savings.
3. Personalization: With data science and big data, organizations
can personalize their products and services to meet the specific
needs and preferences of individual customers. This can lead to
increased customer satisfaction and loyalty.
4. Predictive analytics: Data science and big data can be used to
build predictive models that can forecast future trends and
behavior. This can be useful for businesses that need to anticipate
customer needs, market trends, or supply chain disruptions.
5. Fraud detection: Data science and big data can be used to detect
fraud and other types of financial crimes. By analyzing patterns
in financial data, data scientists can identify suspicious behavior
and prevent fraud.
6. Healthcare: Data science and big data can be used to improve
patient outcomes by analyzing large amounts of medical data.
This can lead to better diagnosis, treatment, and prevention of
diseases.
7. Marketing: Data science and big data can be used to improve
marketing strategies by analyzing consumer behavior and
preferences. This can help businesses target their marketing
campaigns more effectively and generate more leads and sales.
Facets of Data
• Very large amount of data will generate in big data and data science. These data
is various types and main categories of data are as follows:
a) Structured
b) Natural language
c) Graph-based
d) Streaming
e) Unstructured
f) Machine-generated
g) Audio, video and images
Structured Data
• Structured data is arranged in rows and column format. It helps for application to
retrieve and process data easily. Database management system is used for storing
structured data.
• The term structured data refers to data that is identifiable because it is organized
in a structure. The most common form of structured data or records is a database
where specific information is stored based on a methodology of columns and
rows.
• Structured data is also searchable by data type within content. Structured data is
understood by computers and is also efficiently organized for human readers.
• An Excel table is an example of structured data.
Unstructured Data
• Unstructured data is data that does not follow a specified format. Row and
columns are not used for unstructured data. Therefore it is difficult to retrieve
required information. Unstructured data has no identifiable structure.
• The unstructured data can be in the form of Text: (Documents, email messages,
customer feedbacks), audio, video, images. Email is an example of unstructured
data.
• Even today in most of the organizations more than 80 % of the data are in
unstructured form. This carries lots of information. But extracting information
from these various sources is a very big challenge.
• Characteristics of unstructured data:
1. There is no structural restriction or binding for the data.
2. Data can be of any type.
3. Unstructured data does not follow any structural rules.
4. There are no predefined formats, restriction or sequence for unstructured data.
5. Since there is no structural binding for unstructured data, it is unpredictable in
nature.
Natural Language
• Natural language is a special type of unstructured data.
• Natural language processing enables machines to recognize characters, words
and sentences, then apply meaning and understanding to that information. This
helps machines to understand language as humans do.
• Natural language processing is the driving force behind machine intelligence in
many modern real-world applications. The natural language processing
community has had success in entity recognition, topic recognition,
summarization, text completion and sentiment analysis.
•For natural language processing to help machines understand human language, it
must go through speech recognition, natural language understanding and machine
translation. It is an iterative process comprised of several layers of text analysis.
Machine - Generated Data
• Machine-generated data is an information that is created without human
interaction as a result of a computer process or application activity. This means
that data entered manually by an end-user is not recognized to be machine-
generated.
• Machine data contains a definitive record of all activity and behavior of our
customers, users, transactions, applications, servers, networks, factory machinery
and so on.
• It's configuration data, data from APIs and message queues, change events, the
output of diagnostic commands and call detail records, sensor data from remote
equipment and more.
• Examples of machine data are web server logs, call detail records, network event
logs and telemetry.
• Both Machine-to-Machine (M2M) and Human-to-Machine (H2M) interactions
generate machine data. Machine data is generated continuously by every
processor-based system, as well as many consumer-oriented systems.
Graph-based or Network Data
•Graphs are data structures to describe relationships and interactions between
entities in complex systems. In general, a graph contains a collection of entities
called nodes and another collection of interactions between a pair of nodes called
edges.
• Nodes represent entities, which can be of any object type that is relevant to our
problem domain. By connecting nodes with edges, we will end up with a graph
(network) of nodes.
• A graph database stores nodes and relationships instead of tables or documents.
Data is stored just like we might sketch ideas on a whiteboard. Our data is stored
without restricting it to a predefined model, allowing a very flexible way of
thinking about and using it.
• Graph databases are used to store graph-based data and are queried with
specialized query languages such as SPARQL.
Figure 1 Friends in a social network are an example of graph-based data.
• Graph databases are capable of sophisticated fraud prevention. With graph
databases, we can use relationships to process financial and purchase transactions
in near-real time. With fast graph queries, we are able to detect that, for example,
a potential purchaser is using the same email address and credit card as included
in a known fraud case.
• Graph databases can also help user easily detect relationship patterns such as
multiple people associated with a personal email address or multiple people
sharing the same IP address but residing in different physical addresses.
• Graph databases are a good choice for recommendation applications. With graph
databases, we can store in a graph relationships between information categories
such as customer interests, friends and purchase history. We can use a highly
available graph database to make product recommendations to a user based on
which products are purchased by others who follow the same sport and have
similar purchase history.
• Graph theory is probably the main method in social network analysis in the early
history of the social network concept. The approach is applied to social network
analysis in order to determine important features of the network such as the nodes
and links (for example influencers and the followers).
Audio, Image and Video
• Audio, image and video are data types that pose specific challenges to a data
scientist. Tasks that are trivial for humans, such as recognizing objects in pictures,
turn out to be challenging for computers.
•The terms audio and video commonly refers to the time-based media storage
format for sound/music and moving pictures information. Audio and video digital
recording, also referred as audio and video codecs, can be uncompressed, lossless
compressed or lossy compressed depending on the desired quality and use cases.
• It is important to remark that multimedia data is one of the most important
sources of information and knowledge; the integration, transformation and
indexing of multimedia data bring significant challenges in data management and
analysis. Many challenges have to be addressed including big data,
multidisciplinary nature of Data Science and heterogeneity.
• Data Science is playing an important role to address these challenges in
multimedia data. Multimedia data usually contains various forms of media, such
as text, image, video, geographic coordinates and even pulse waveforms, which
come from multiple sources. Data Science can be a key instrument covering big
data, machine learning and data mining solutions to store, handle and analyze
such heterogeneous data.
Streaming Data
Streaming data is data that is generated continuously by thousands of data
sources, which typically send in the data records simultaneously and in small sizes
(order of Kilobytes).
• Streaming data includes a wide variety of data such as log files generated by
customers using your mobile or web applications, ecommerce purchases, in-game
player activity, information from social networks, financial trading floors or
geospatial services and telemetry from connected devices or instrumentation in
data centers.
The data science process:
The data science process typically involves the following steps:
1. Define the problem: The first step in the data science process is to define
the problem that you want to solve. This involves identifying the business
or research question that you want to answer and determining what data
you need to collect.
2. Collect and clean the data: Once you have identified the data that you
need, you will need to collect and clean the data to ensure that it is
accurate and complete. This involves checking for errors, missing values,
and inconsistencies.
3. Explore and visualize the data: After you have collected and cleaned the
data, the next step is to explore and visualize the data. This involves
creating summary statistics, visualizations, and other descriptive analyses
to better understand the data.
4. Prepare the data: Once you have explored the data, you will need to
prepare the data for analysis. This involves transforming and
manipulating the data, creating new variables, and selecting relevant
features.
5. Build the model: With the data prepared, the next step is to build a model
that can answer the business or research question that you identified in
step one. This involves selecting an appropriate algorithm, training the
model, and evaluating its performance.
6. Evaluate the model: Once you have built the model, you will need to
evaluate its performance to ensure that it is accurate and effective. This
involves using metrics such as accuracy, precision, recall, and F1 score to
assess the model's performance.
7. Deploy the model: After you have evaluated the model, the final step is to
deploy the model in a production environment. This involves integrating
the model into an application or workflow and ensuring that it can handle
real-world data and user inputs.
The big data ecosystem and data science:
❖ The big data ecosystem and data science are closely related, as the former
provides the infrastructure and tools that enable the latter.
❖ The big data ecosystem refers to the set of technologies, platforms, and
frameworks that are used to store, process, and analyze large volumes of
data.
❖ Some of the key components of the big data ecosystem include:
1. Storage: Big data storage systems such as Hadoop Distributed File
System (HDFS), Apache Cassandra, and Amazon S3 are designed to
store and manage large volumes of data across multiple nodes.
2. Processing: Big data processing frameworks such as Apache Spark,
Apache Flink, and Apache Storm are used to process and analyze large
volumes of data in parallel across distributed computing clusters.
3. Querying: Big data querying systems such as Apache Hive, Apache Pig,
and Apache Drill are used to extract and transform data stored in big data
storage systems.
4. Visualization: Big data visualization tools such as Tableau, D3.js, and
Apache Zeppelin are used to create interactive visualizations and
dashboards that enable data scientists and business analysts to explore
and understand data.
5. Machine learning: Big data machine learning platforms such as Apache
Mahout, TensorFlow, and Microsoft Azure Machine Learning are used to
build and deploy machine learning models at scale.
The data science process: Overview of the data science process:
Defining research goals and creating a project character:
Defining research goals and creating a project charter are important initial
steps in any data science project, as they set the stage for the entire project and
help ensure that it stays focused and on track.
Here are some key considerations for defining research goals and creating
a project charter in data science:
Identify the problem or question you want to answer: What is the business
problem or research question that you are trying to solve? It's important to
clearly define the problem or question at the outset of the project, so that
everyone involved is on the same page and working towards the same goal.
❖ Define the scope of the project: Once you have identified the problem or
question, you need to define the scope of the project. This includes
specifying the data sources you will be using, the variables you will be
analyzing, and the timeframe for the project.
❖ Determine the project objectives: What do you hope to achieve with the
project? What are your key performance indicators (KPIs)? This will help
you measure the success of the project and determine whether you have
achieved your goals.
❖ Identify the stakeholders: Who are the key stakeholders in the project?
This could include business leaders, data analysts, data scientists, and
other team members. It's important to identify all the stakeholders upfront
so that everyone is aware of their role in the project and can work
together effectively.
❖ Create a project charter: The project charter is a document that
summarizes the key information about the project, including the problem
or question, the scope of the project, the objectives, the stakeholders, and
any constraints or risks. It's a critical document that helps ensure
everyone involved in the project is on the same page and understands
what is expected of them.
Retrieving data:
Retrieving data is an essential step in the data science process as
it provides theraw material needed to analyze and derive insights. There are
various ways to retrieve data, and the methods used depend on the type of data
and where it is stored.
Here are some common methods for retrieving data in data science:
➢ File import: Data can be retrieved from files in various formats, such as
CSV, Excel, JSON, or XML. This is a common method used to retrieve
data that is stored locally.
➢ Web scraping: Web scraping involves using scripts to extract data from
websites. This is a useful method for retrieving data that is not readily
available in a structured format.
➢ APIs: Many applications and services provide APIs (Application
Programming Interfaces) that allow data to be retrieved
programmatically. APIs can be used to retrieve data from social media
platforms, weather services, financial data providers, and many other
sources.
➢ Databases: Data is often stored in databases, and SQL (Structured Query
Language) can be used to retrieve data from databases. Non-relational
databases such as MongoDB or Cassandra are also popular for storing
and retrieving data.
➢ Big Data platforms: When dealing with large amounts of data, big data
platforms such as Hadoop, Spark, or NoSQL databases can be used to
retrieve data efficiently.
Cleansing, integrating and transforming data
Cleansing, integrating, and transforming data are essential steps in the data
preparation process in data science. These steps are necessary to ensure that the
data is accurate, consistent, and usable for analysis. Here's an overview of each
step:
▪ Data Cleansing: This step involves identifying and correcting or
removing any errors, inconsistencies, or missing values in the data. Some
common techniques used for data cleansing include removing duplicates,
filling in missing values, correcting spelling errors, and dealing with
outliers.
▪ Data Integration: In many cases, data comes from multiple sources, and
data integration is needed to combine the data into a single dataset. This
can involve matching and merging datasets based on common fields or
keys, and handling any discrepancies or inconsistencies between the
datasets.
Figure1 : Joining two tables on the Item and Region keys
▪ Data Transformation: Data transformation involves converting the data
into a format that is more suitable for analysis. This can involve
converting categorical variables into numerical variables, scaling or
normalizing data, and creating new variables or features from existing
data.
Exploratory data analysis:
Exploratory data analysis (EDA) is the process of analyzing and
summarizing data sets in order to gain insights and identify patterns.
The main goal of EDA is to understand the data, rather than to test a
particular hypothesis. The process typically involves visualizing the data
using graphs, charts, and tables, as well as calculating summary statistics
such as mean, median, and standard deviation.
Figure 2: From top to bottom, a bar chart, a line plot, and a distribution are some of the graphs
used in exploratory analysis.
Some common techniques used in EDA include:
❖ Descriptive statistics: This involves calculating summary statistics such
as mean, median, mode, standard deviation, and range.
❖ Data visualization: This involves creating graphs, charts, and other
visual representations of the data, such as histograms, scatter plots, and
box plots.
❖ Data transformation: This involves transforming the data to make it
easier to analyze, such as normalizing or standardizing the data, or log
transforming skewed data.
❖ Outlier detection: This involves identifying and analyzing data points
that are significantly different from the other data points.
❖ Correlation analysis: This involves examining the relationship between
different variables in the data set, such as calculating correlation
coefficients or creating correlation matrices.
Overall, EDA is an important step in any data analysis project, as it helps to
identify any patterns, outliers, or other trends in the data that may be relevant to
the analysis. It also helps to ensure that the data is clean, complete, and ready
for further analysis.
Model building
• To build the model, data should be clean and understand the content properly.
The components of model building are as follows:
a) Selection of model and variable
b) Execution of model
c) Model diagnostic and model comparison
Model and Variable Selection
• For this phase, consider model performance and whether project meets all the
requirements to use model, as well as other factors:
1. Must the model be moved to a production environment and, if so, would it be
easy to implement?
2. How difficult is the maintenance on the model: how long will it remain relevant
if left untouched?
3. Does the model need to be easy to explain?
Model Execution
• Various programming language is used for implementing the model. For model
execution, Python provides libraries like StatsModels or Scikit-learn. These
packages use several of the most popular techniques.
• Coding a model is a nontrivial task in most cases, so having these libraries
available can speed up the process. Following are the remarks on output:
a) Model fit: R-squared or adjusted R-squared is used.
b) Predictor variables have a coefficient: For a linear model this is easy to
interpret.
c) Predictor significance: Coefficients are great, but sometimes not enough
evidence exists to show that the influence is there.
• Linear regression works if we want to predict a value, but for classify
something, classification models are used. The k-nearest neighbors method is one
of the best method.
Following commercial tools are used :
1. SAS enterprise miner: This tool allows users to run predictive and descriptive
models based on large volumes of data from across the enterprise.
2. SPSS modeler: It offers methods to explore and analyze data through a GUI.
3. Matlab: Provides a high-level language for performing a variety of data
analytics, algorithms and data exploration.
4. Alpine miner: This tool provides a GUI front end for users to develop analytic
workflows and interact with Big Data tools and platforms on the back end.
Open Source tools:
1. R and PL/R: PL/R is a procedural language for PostgreSQL with R.
2. Octave: A free software programming language for computational modeling,
has some of the functionality of Matlab.
3. WEKA: It is a free data mining software package with an analytic workbench.
The functions created in WEKA can be executed within Java code.
4. Python is a programming language that provides toolkits for machine learning
and analysis.
5. SQL in-database implementations, such as MADlib provide an alterative to in
memory desktop analytical tools.
Model Diagnostics and Model Comparison
Try to build multiple models and then select best one based on multiple criteria.
• In Holdout Method, the data is split into two different datasets labelled as a
training and a testing dataset. This can be a 60/40 or 70/30 or 80/20 split. This
technique is called the hold-out validation technique.
Suppose we have a database with house prices as the dependent variable and two
independent variables showing the square footage of the house and the number of
rooms. Now, imagine this dataset has 30 rows. The whole idea is that you build a
model that can predict house prices accurately.
• To 'train' our model or see how well it performs, we randomly subset 20 of those
rows and fit the model. The second step is to predict the values of those 10 rows
that we excluded and measure how well our predictions were.
Presenting Findings and Building Applications
• The team delivers final reports, briefings, code and technical documents.
• In addition, team may run a pilot project to implement the models in a
production environment.
The last stage of the data science process is where user soft skills will be most
useful.
• Presenting your results to the stakeholders and industrializing your analysis
process for repetitive reuse and integration with other tools.