03-Data Science Methodology
03-Data Science Methodology
Analytic The process of selecting the appropriate method or path to address a specific data
Approach science question or problem.
Business The initial phase of data science methodology involves seeking clarification and
Understanding understanding the goals, objectives, and requirements of a given task or problem.
Clustering An approach used to learn about human behavior and identify patterns and
Association associations in data.
Congestive A chronic condition in which the heart cannot pump enough blood to meet the
Heart Failure body's needs, resulting in fluid buildup and symptoms such as shortness of breath
(CHF) and fatigue.
Cross-Industry Standard Process for Data Mining is a widely used methodology for
CRISP-DM data mining and analytics projects encompassing six phases: business understandin
data understanding, data preparation, modeling, evaluation, and deployment.
Data science A structured approach to solving business problems using data analysis and data-
methodology driven insights.
A professional using scientific methods, algorithms, and tools to analyze data, extra
Data scientist
insights, and develop models or solutions to complex business problems.
Professionals with data science and analytics expertise who apply their skills to solv
Data scientists
business problems.
Data-Driven
Insights derived from analyzing and interpreting data to inform decision-making
Insights
Decision Tree A model that uses a tree-like structure to classify data based on conditions and
Term Definition
Classification
thresholds provides predicted outcomes and associated probabilities.
Model
Decision Tree A classification model that uses a decision tree to determine outcomes based on
Classifier specific conditions and thresholds.
Decision-Tree A model used to review scenarios and identify relationships in data, such as the
Model reasons for patient readmissions
Descriptive An approach used to show relationships and identify clusters of similar activities
approach based on events and preferences
Descriptive Modeling technique that focuses on describing and summarizing data, often throug
modeling statistical analysis and visualization, without making predictions or inferences
Domain Expertise and understanding of a specific subject area or field, including its concept
knowledge principles, and relevant data
Goals and The sought-after outcomes and specific objectives that support the overall goal of t
objectives task or problem.
Iterative A process that involves repeating a series of steps or actions to refine and improve
process solution or analysis. Each iteration builds upon the previous one.
Leaf The final nodes of a decision tree where data is categorized into specific outcomes.
Machine A field of study that enables computers to learn from data without being explicitly
Learning programmed, identifying hidden relationships and trends.
The average value of a set of numbers is calculated by summing all the values and
Mean
dividing by the total number of values.
Model
A simplified representation or abstraction of a real-world system or phenomenon
(Conceptual
used to understand, analyze, or predict its behavior.
model)
Model The process of developing predictive models to gain insights and make informed
building decisions based on data analysis.
Pairwise
A statistical technique that measures the strength and direction of the linear
comparison
relationship between two variables by calculating a correlation coefficient.
(correlation)
model data.
Variables or features in a model that are used to predict or explain the outcome
Predictors
variable or target variable.
The process of organizing objectives and tasks based on their importance and impa
Prioritization
on the overall goal.
Problem The process of addressing challenges and finding solutions to achieve desired
solving outcomes.
Individuals or groups with a vested interest in the data science model's outcome an
Stakeholders its practical application, such as solution owners, marketing, application developers
and IT administration.
Standard A measure of the dispersion or variability of a set of values from their mean; It
deviation provides information about the spread or distribution of the data.
Statistical Stand deviations are applied to problems that require counts, such as yes/no answe
analysis or classification tasks.
Structured
Data organized and formatted according to a predefined schema or model and is
data (data
typically stored in databases or spreadsheets.
model)
Threshold
The specific value used to split data into groups or categories in a decision tree.
value
A group of professionals, including data scientists and analysts, responsible for per
Analytics team
data analysis and modeling.
The process of gathering data from various sources, including demographic, clinica
Data collection
coverage, and pharmaceutical information.
Data The merging of data from multiple sources to remove redundancy and prepare it f
integration further analysis.
Data The process of organizing and formatting data to meet the requirements of the mo
Preparation technique.
Data The identification and definition of the necessary data elements, formats, and sour
Requirements required for analysis.
Data A stage where data scientists discuss various ways to manage data effectively, inclu
Understanding automating certain processes in the database.
DBAs
(Database
The professionals who are responsible for managing and extracting data from data
Administrators
)
Decision tree A modeling technique that uses a tree-like structure to classify data based on spec
classification conditions and variables.
Demographic
Information about patient characteristics, such as age, gender, and location.
information
Descriptive Techniques used to analyze and summarize data, providing initial insights and iden
statistics gaps in data.
Intermediate Partial results obtained from predictive modeling can influence decisions on acquir
results additional data.
Patient cohort A group of patients with specific criteria selected for analysis in a study or model.
Predictive
The building of models to predict future outcomes based on historical data.
modeling
A subset of data used to train or fit a machine learning model; consists of input da
Training set
corresponding known or labeled output values.
Unavailable
Data elements are not currently accessible or integrated into the data sources.
data
Unstructured Data that does not have a predefined structure or format, typically text images, au
data or video, requires special techniques to extract meaning or insights.
Visualization The process of representing data visually to gain insights into its content and quali
Automation Using tools and techniques to streamline data collection and preparation processes.
Data
The phase of gathering and assembling data from various sources.
Collection
Data
The process of organizing and structuring data to create a comprehensive data set.
Compilation
Data
The process of standardizing the data to ensure uniformity and ease of analysis.
Formatting
Data
The process of transforming data into a usable format.
Manipulation
Data The phase where data is cleaned, transformed, and formatted for further analysis,
Preparation including feature engineering and text analysis.
Data The stage where data is transformed and organized to facilitate effective analysis and
Preparation modeling.
Data Quality
The evaluation of data integrity, accuracy, and completeness.
Assessment
Data The stage in the data science methodology focused on exploring and analyzing the
Understanding collected data to ensure that the data is representative of the problem to be solved.
Descriptive Summary statistics that data scientists use to describe and understand the distribution of
Statistics variables, such as mean, median, minimum, maximum, and standard deviation.
Feature A characteristic or attribute within the data that helps in solving the problem.
Feature The process of creating new features or variables based on domain knowledge to
Engineering improve machine learning algorithms' performance.
Feature
Identifying and selecting relevant features or attributes from the data set.
Extraction
Interactive Iterative and continuous refinement of the methodology based on insights and feedback
Processes from data analysis.
Missing Values that are absent or unknown in the dataset, requiring careful handling during data
Values preparation.
Model
Adjusting model parameters to improve accuracy and alignment with the initial design.
Calibration
Pairwise
An analysis to determine the relationships and correlations between different variables.
Correlations
Steps to analyze and manipulate textual data, extracting meaningful information and
Text Analysis
patterns.
Text Analysis
Creating meaningful groupings and categories from textual data for analysis.
Groupings
Methods and tools that data scientists use to create visual representations or graphics
Visualization
that enhance the accessibility and understanding of data patterns, relationships, and
techniques
insights.
From Modeling to Evaluation
Term Definition
Binary
classification A model that classifies data into two categories, such as yes/no or stop/go outcom
model
Data
The process of gathering and organizing data required for modeling.
compilation
The stage in the data science methodology where data scientists develop models,
Data modeling
either descriptive or predictive, to answer specific questions.
Descriptive A type of model that examines relationships between variables and makes inferen
model based on observed patterns.
Diagnostic
The process of fine-tuning the model by adjusting parameters based on diagnosti
measure based
measures and performance indicators.
tuning
Diagnostic The evaluation of a model's performance of a model to ensure that the model
measures functions as intended.
Discrimination A measure used to evaluate the performance of the model in classifying different
criterion outcomes.
False-positive
The rate at which the model incorrectly identifies negative outcomes as positive.
rate
Maximum The point where the ROC curve provides the best discrimination between true-
separation positive and false-positive rates, indicating the most effective model.
Model
The process of assessing the quality and relevance of the model before deployme
evaluation
The model that provides the maximum separation between the ROC curve and the
Optimal model
baseline, indicating higher accuracy and effectiveness.
Receiver
Operating Originally developed for military radar, the military used this statistical curve to as
Characteristic the performance of binary classification models.
(ROC)
Relative
This measurement is a parameter in model building used to tune the trade-off
misclassification
between true-positive and false-positive rates.
cost
ROC curve A diagnostic tool used to determine the optimal classification model's performanc
(Receiver
Term Definition
Operating
Characteristic
curve)
Statistical
Evaluation technique to verify that data is appropriately handled and interpreted
significance
within the model.
testing
True-positive
The rate at which the model correctly identifies positive outcomes.
rate
The primary goal of the Business Understanding stage is to understand the business problem
and determine the data needed to answer the core business question.
During the Analytic Approach stage, you can choose from descriptive diagnostic, predictive,
and prescriptive analytic approaches and whether to use machine learning techniques.
During the Data Requirements stage, scientists identify the correct and necessary data content,
formats, and sources needed for the specific analytical approach.
During the Data Collection stage, expert data scientists revise data requirements and make
critical decisions regarding the quantity and quality of data. Data scientists apply descriptive
statistics and visualization techniques to thoroughly assess the content, quality, and initial
insights gained from the collected data, identify gaps, and determine if new data is needed, or
if they should substitute existing data.
The Data Understanding stage encompasses all activities related to constructing the data set.
This stage answers the question of whether the collected data represents the data needed to
solve the business problem. Data scientists might use descriptive statistics, predictive
statistics, or both.
Data scientists commonly apply Hurst, univariates, and statistics such as mean, median,
minimum, maximum, standard deviation, pairwise correlation, and histograms.
During the Data Preparation stage, data scientists must address missing or invalid values,
remove duplicates, and validate that the data is properly formatted. Feature engineering and
text analysis are key techniques data scientists apply to validate and analyze data during the
Data Preparation stage.
The end goal of the Modeling stage is that the data model answers the business question.
During the Modeling stage, data scientists use a training data set. Data scientists test multiple
algorithms on the training set data to determine whether the variables are required and
whether the data supports answering the business question. The outcome of those models is
either descriptive or predictive.
The Evaluation stage consists of two phases, the diagnostic measures phase, and the statistical
significance phase. Data scientists and others assess the quality of the model and determine if
the model answers the initial Business Understanding question or if the data model needs
adjustment.
During the Deployment stage, data scientists release the data model to a targeted group of
stakeholders, including solution owners, marketing staff, application developers, and IT
administration.,
During the Feedback stage, stakeholders and users evaluate the model and contribute
feedback to assess the model’s performance.
The data model’s value depends on its ability to iterate; that is, how successfully the data
model incorporates user feedback.