[go: up one dir, main page]

0% found this document useful (0 votes)
207 views8 pages

03-Data Science Methodology

The document defines various terms related to data science and analytics. It provides definitions for analytic approach, analytics, business understanding, clustering, cohort, cohort study, congestive heart failure, CRISP-DM, data analysis, data cleansing, data science, data science methodology, data scientist, data-driven insights, decision tree, decision tree classification model, decision tree classifier, decision tree model, descriptive approach, descriptive modeling, domain knowledge, goals and objectives, iteration, iterative process, leaf, machine learning, mean, median, model, model building, pairwise comparison, pattern, predictive model, predictors, prioritization, problem solving, stakeholders, standard deviation, statistical analysis, statistics, structured data, text analysis, threshold value,

Uploaded by

abdessalemdjoudi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
207 views8 pages

03-Data Science Methodology

The document defines various terms related to data science and analytics. It provides definitions for analytic approach, analytics, business understanding, clustering, cohort, cohort study, congestive heart failure, CRISP-DM, data analysis, data cleansing, data science, data science methodology, data scientist, data-driven insights, decision tree, decision tree classification model, decision tree classifier, decision tree model, descriptive approach, descriptive modeling, domain knowledge, goals and objectives, iteration, iterative process, leaf, machine learning, mean, median, model, model building, pairwise comparison, pattern, predictive model, predictors, prioritization, problem solving, stakeholders, standard deviation, statistical analysis, statistics, structured data, text analysis, threshold value,

Uploaded by

abdessalemdjoudi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 8

Term Definition

Analytic The process of selecting the appropriate method or path to address a specific data
Approach science question or problem.

The systematic analysis of data using statistical, mathematical, and computational


Analytics
techniques to uncover insights, patterns, and trends.

Business The initial phase of data science methodology involves seeking clarification and
Understanding understanding the goals, objectives, and requirements of a given task or problem.

Clustering An approach used to learn about human behavior and identify patterns and
Association associations in data.

A group of individuals who share a common characteristic or experience is studied


Cohort
analyzed as a unit.

An observational study where a group of individuals with a specific characteristic or


Cohort study exposure is followed over time to determine the incidence of outcomes or the
relationship between exposures and outcomes.

Congestive A chronic condition in which the heart cannot pump enough blood to meet the
Heart Failure body's needs, resulting in fluid buildup and symptoms such as shortness of breath
(CHF) and fatigue.

Cross-Industry Standard Process for Data Mining is a widely used methodology for
CRISP-DM data mining and analytics projects encompassing six phases: business understandin
data understanding, data preparation, modeling, evaluation, and deployment.

The process of inspecting, cleaning, transforming, and modeling data to discover


Data analysis
useful information, draw conclusions, and support decision-making.

The process of identifying and correcting or removing errors, inconsistencies, or


Data cleansing
inaccuracies in a dataset to improve its quality and reliability

An interdisciplinary field that combines scientific methods, processes, algorithms, an


Data science
systems to extract knowledge and insights from structured and unstructured data.

Data science A structured approach to solving business problems using data analysis and data-
methodology driven insights.

A professional using scientific methods, algorithms, and tools to analyze data, extra
Data scientist
insights, and develop models or solutions to complex business problems.

Professionals with data science and analytics expertise who apply their skills to solv
Data scientists
business problems.

Data-Driven
Insights derived from analyzing and interpreting data to inform decision-making
Insights

A supervised machine learning algorithm that uses a tree-like structure of decisions


Decision tree
and their possible consequences to make predictions or classify instances.

Decision Tree A model that uses a tree-like structure to classify data based on conditions and
Term Definition

Classification
thresholds provides predicted outcomes and associated probabilities.
Model

Decision Tree A classification model that uses a decision tree to determine outcomes based on
Classifier specific conditions and thresholds.

Decision-Tree A model used to review scenarios and identify relationships in data, such as the
Model reasons for patient readmissions

Descriptive An approach used to show relationships and identify clusters of similar activities
approach based on events and preferences

Descriptive Modeling technique that focuses on describing and summarizing data, often throug
modeling statistical analysis and visualization, without making predictions or inferences

Domain Expertise and understanding of a specific subject area or field, including its concept
knowledge principles, and relevant data

Goals and The sought-after outcomes and specific objectives that support the overall goal of t
objectives task or problem.

A single cycle or repetition of a process often involves refining or modifying a solut


Iteration
based on feedback or new information.

Iterative A process that involves repeating a series of steps or actions to refine and improve
process solution or analysis. Each iteration builds upon the previous one.

Leaf The final nodes of a decision tree where data is categorized into specific outcomes.

Machine A field of study that enables computers to learn from data without being explicitly
Learning programmed, identifying hidden relationships and trends.

The average value of a set of numbers is calculated by summing all the values and
Mean
dividing by the total number of values.

When arranged in ascending or descending order, the middle value in a set of


Median
numbers divides the data into two equal halves.

Model
A simplified representation or abstraction of a real-world system or phenomenon
(Conceptual
used to understand, analyze, or predict its behavior.
model)

Model The process of developing predictive models to gain insights and make informed
building decisions based on data analysis.

Pairwise
A statistical technique that measures the strength and direction of the linear
comparison
relationship between two variables by calculating a correlation coefficient.
(correlation)

A recurring or noticeable arrangement or sequence in data can provide insights or b


Pattern
used for prediction or classification.

Predictive A model used to determine probabilities of an action or outcome based on historic


Term Definition

model data.

Variables or features in a model that are used to predict or explain the outcome
Predictors
variable or target variable.

The process of organizing objectives and tasks based on their importance and impa
Prioritization
on the overall goal.

Problem The process of addressing challenges and finding solutions to achieve desired
solving outcomes.

Individuals or groups with a vested interest in the data science model's outcome an
Stakeholders its practical application, such as solution owners, marketing, application developers
and IT administration.

Standard A measure of the dispersion or variability of a set of values from their mean; It
deviation provides information about the spread or distribution of the data.

Statistical Stand deviations are applied to problems that require counts, such as yes/no answe
analysis or classification tasks.

The collection, analysis, interpretation, presentation, and organization of data to


Statistics
understand patterns, relationships, and variability in the data.

Structured
Data organized and formatted according to a predefined schema or model and is
data (data
typically stored in databases or spreadsheets.
model)

The process of extracting useful information or knowledge from unstructured textu


Text analysis
data through techniques such as natural language processing, text mining, and
data mining
sentiment analysis.

Threshold
The specific value used to split data into groups or categories in a decision tree.
value

A group of professionals, including data scientists and analysts, responsible for per
Analytics team
data analysis and modeling.

The process of gathering data from various sources, including demographic, clinica
Data collection
coverage, and pharmaceutical information.

Data The merging of data from multiple sources to remove redundancy and prepare it f
integration further analysis.

Data The process of organizing and formatting data to meet the requirements of the mo
Preparation technique.

Data The identification and definition of the necessary data elements, formats, and sour
Requirements required for analysis.

Data A stage where data scientists discuss various ways to manage data effectively, inclu
Understanding automating certain processes in the database.
DBAs
(Database
The professionals who are responsible for managing and extracting data from data
Administrators
)

Decision tree A modeling technique that uses a tree-like structure to classify data based on spec
classification conditions and variables.

Demographic
Information about patient characteristics, such as age, gender, and location.
information

Descriptive Techniques used to analyze and summarize data, providing initial insights and iden
statistics gaps in data.

Intermediate Partial results obtained from predictive modeling can influence decisions on acquir
results additional data.

Patient cohort A group of patients with specific criteria selected for analysis in a study or model.

Predictive
The building of models to predict future outcomes based on historical data.
modeling

A subset of data used to train or fit a machine learning model; consists of input da
Training set
corresponding known or labeled output values.

Unavailable
Data elements are not currently accessible or integrated into the data sources.
data

Modeling analysis focused on a single variable or feature at a time, considering its


Univariate
characteristics and relationship to other variables independently.

Unstructured Data that does not have a predefined structure or format, typically text images, au
data or video, requires special techniques to extract meaning or insights.

Visualization The process of representing data visually to gain insights into its content and quali

From Understanding to Preparation


Term Definition

Automation Using tools and techniques to streamline data collection and preparation processes.

Data
The phase of gathering and assembling data from various sources.
Collection

Data
The process of organizing and structuring data to create a comprehensive data set.
Compilation

Data
The process of standardizing the data to ensure uniformity and ease of analysis.
Formatting

Data
The process of transforming data into a usable format.
Manipulation

Data The phase where data is cleaned, transformed, and formatted for further analysis,
Preparation including feature engineering and text analysis.

Data The stage where data is transformed and organized to facilitate effective analysis and
Preparation modeling.

Assessment of data integrity and completeness, addressing missing, invalid, or


Data Quality
misleading values.

Data Quality
The evaluation of data integrity, accuracy, and completeness.
Assessment

Data Set A collection of data used for analysis and modeling.

Data The stage in the data science methodology focused on exploring and analyzing the
Understanding collected data to ensure that the data is representative of the problem to be solved.

Descriptive Summary statistics that data scientists use to describe and understand the distribution of
Statistics variables, such as mean, median, minimum, maximum, and standard deviation.

Feature A characteristic or attribute within the data that helps in solving the problem.

Feature The process of creating new features or variables based on domain knowledge to
Engineering improve machine learning algorithms' performance.

Feature
Identifying and selecting relevant features or attributes from the data set.
Extraction

Interactive Iterative and continuous refinement of the methodology based on insights and feedback
Processes from data analysis.

Missing Values that are absent or unknown in the dataset, requiring careful handling during data
Values preparation.

Model
Adjusting model parameters to improve accuracy and alignment with the initial design.
Calibration

Pairwise
An analysis to determine the relationships and correlations between different variables.
Correlations

Steps to analyze and manipulate textual data, extracting meaningful information and
Text Analysis
patterns.

Text Analysis
Creating meaningful groupings and categories from textual data for analysis.
Groupings

Methods and tools that data scientists use to create visual representations or graphics
Visualization
that enhance the accessibility and understanding of data patterns, relationships, and
techniques
insights.
From Modeling to Evaluation
Term Definition

Binary
classification A model that classifies data into two categories, such as yes/no or stop/go outcom
model

Data
The process of gathering and organizing data required for modeling.
compilation

The stage in the data science methodology where data scientists develop models,
Data modeling
either descriptive or predictive, to answer specific questions.

Descriptive A type of model that examines relationships between variables and makes inferen
model based on observed patterns.

Diagnostic
The process of fine-tuning the model by adjusting parameters based on diagnosti
measure based
measures and performance indicators.
tuning

Diagnostic The evaluation of a model's performance of a model to ensure that the model
measures functions as intended.

Discrimination A measure used to evaluate the performance of the model in classifying different
criterion outcomes.

False-positive
The rate at which the model incorrectly identifies negative outcomes as positive.
rate

A graphical representation of the distribution of a dataset, where the data is divid


Histogram into intervals or bins, and the height of each bar represents the frequency or coun
data points falling within that interval.

Maximum The point where the ROC curve provides the best discrimination between true-
separation positive and false-positive rates, indicating the most effective model.

Model
The process of assessing the quality and relevance of the model before deployme
evaluation

The model that provides the maximum separation between the ROC curve and the
Optimal model
baseline, indicating higher accuracy and effectiveness.

Receiver
Operating Originally developed for military radar, the military used this statistical curve to as
Characteristic the performance of binary classification models.
(ROC)

Relative
This measurement is a parameter in model building used to tune the trade-off
misclassification
between true-positive and false-positive rates.
cost

ROC curve A diagnostic tool used to determine the optimal classification model's performanc
(Receiver
Term Definition

Operating
Characteristic
curve)

Separation is the degree of discrimination achieved by the model in correctly


Separation
classifying outcomes.

Statistical
Evaluation technique to verify that data is appropriately handled and interpreted
significance
within the model.
testing

True-positive
The rate at which the model correctly identifies positive outcomes.
rate

 Foundational methodology, a cyclical, iterative data science methodology developed by John


Rollins, consists of 10 stages, starting with Business Understanding and ending with
Feedback.

 CRISP-DM, an open source data methodology, combines several data-related methodology


stages into one stage and omits the Feedback stage resulting in a six-stage data methodology.

 The primary goal of the Business Understanding stage is to understand the business problem
and determine the data needed to answer the core business question.

 During the Analytic Approach stage, you can choose from descriptive diagnostic, predictive,
and prescriptive analytic approaches and whether to use machine learning techniques.

 During the Data Requirements stage, scientists identify the correct and necessary data content,
formats, and sources needed for the specific analytical approach.

 During the Data Collection stage, expert data scientists revise data requirements and make
critical decisions regarding the quantity and quality of data. Data scientists apply descriptive
statistics and visualization techniques to thoroughly assess the content, quality, and initial
insights gained from the collected data, identify gaps, and determine if new data is needed, or
if they should substitute existing data.

 The Data Understanding stage encompasses all activities related to constructing the data set.
This stage answers the question of whether the collected data represents the data needed to
solve the business problem. Data scientists might use descriptive statistics, predictive
statistics, or both.

 Data scientists commonly apply Hurst, univariates, and statistics such as mean, median,
minimum, maximum, standard deviation, pairwise correlation, and histograms.
 During the Data Preparation stage, data scientists must address missing or invalid values,
remove duplicates, and validate that the data is properly formatted. Feature engineering and
text analysis are key techniques data scientists apply to validate and analyze data during the
Data Preparation stage.

 The end goal of the Modeling stage is that the data model answers the business question.
During the Modeling stage, data scientists use a training data set. Data scientists test multiple
algorithms on the training set data to determine whether the variables are required and
whether the data supports answering the business question. The outcome of those models is
either descriptive or predictive.

 The Evaluation stage consists of two phases, the diagnostic measures phase, and the statistical
significance phase. Data scientists and others assess the quality of the model and determine if
the model answers the initial Business Understanding question or if the data model needs
adjustment.

 During the Deployment stage, data scientists release the data model to a targeted group of
stakeholders, including solution owners, marketing staff, application developers, and IT
administration.,

 During the Feedback stage, stakeholders and users evaluate the model and contribute
feedback to assess the model’s performance.

 The data model’s value depends on its ability to iterate; that is, how successfully the data
model incorporates user feedback.

You might also like