[go: up one dir, main page]

0% found this document useful (0 votes)
16 views70 pages

Data Science Unit 1

The document provides an overview of data science, its applications in business, and the data science life cycle, which includes defining goals, data retrieval, preparation, exploration, modeling, and presentation. It discusses the roles of data professionals such as data scientists, analysts, and engineers, as well as the significance of big data and machine learning in extracting insights from data. Additionally, it outlines various use cases for data science across different industries and highlights the importance of understanding data types and characteristics.

Uploaded by

2bpcskygcx
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views70 pages

Data Science Unit 1

The document provides an overview of data science, its applications in business, and the data science life cycle, which includes defining goals, data retrieval, preparation, exploration, modeling, and presentation. It discusses the roles of data professionals such as data scientists, analysts, and engineers, as well as the significance of big data and machine learning in extracting insights from data. Additionally, it outlines various use cases for data science across different industries and highlights the importance of understanding data types and characteristics.

Uploaded by

2bpcskygcx
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

DATA SCIENCE

Unit 1
BY, SR,BP,UR (CSE-IITE)
Content
🞂 Introduction to data Science

🞂 Data Science in Business

🞂 Use Cases for Data Science

🞂 Data science and Big data

🞂 Data Science and Machine learning

🞂 Data Science Process Overview (DSLC)

🞂 Defining goals, Retrieving data, Data preparation, Data Exploration,


Data modeling, Presentation
Introduction of Data Science
🞂 Firstly, We need to know “what is data?”
🞂 Collection of individual facts or information
🞂 Data is collection of different numbers, symbols & alphabets to represent information.

🞂 The quantities, characters, or symbols on which operations are performed by a


computer, which may be stored and transmitted in the form of electrical signals and
recorded on magnetic, optical, or mechanical recording media.
Introduction of Data Science

Data Comes From Types of Data


Data Science
🞂 Data Science is the study of extracting meaningful patterns and insights from raw data
using a combination of statistics, programming, and domain knowledge.

🞂 It deals with structured (e.g., tables) and unstructured (e.g., images, text) data to solve
real-world problems.

🞂 Data Science is an interdisciplinary field that combines techniques from statistics,


computer science, and domain expertise to extract meaningful insights and knowledge
from structured and unstructured data.

🞂 It focuses on gathering, cleaning, analyzing, and interpreting data to support decision-


making and innovation.

🞂 Data Science is a generation of actionable knowledge directly from huge amount of


complex data.

🞂 The main Goal of Data Science to gain insights from any type of data.
Data Science in Business
🞂 What Do Data Science Professionals Do?

🞂 Data science professionals, often referred to as data scientists, perform a variety of


tasks, including:

🞂 Identify Business Problems (Defining Problems) :

🞂 Collaborate with stakeholders to define goals and requirements.

🞂 Data Acquisition (Data Collection:):

🞂 Retrieve data from various sources like databases, APIs, or IoT devices, web
scraping

🞂 Data Cleaning and Preparation:

🞂 Handle inconsistencies, missing values, and transform data.


Data Science in Business
🞂 What Do Data Science Professionals Do?

🞂 Exploratory Data Analysis (EDA):

🞂 Discover trends, correlations, and anomalies in data.

🞂 Model Development:

🞂 Build and fine-tune predictive or descriptive models using machine learning or


statistical methods.

🞂 Communicate Results (Visualization and Communication):

🞂 Use dashboards, charts, and reports to present actionable insights.


Data Science in Business
🞂 What Do Data Scientists Do?

🞂 Collect and preprocess raw data.

🞂 Perform exploratory data analysis to identify trends.

🞂 Build and evaluate machine learning models.

🞂 Create visualizations and reports to communicate findings.

🞂 Work with business teams to align data strategies with goals.


Data Science in Business
🞂 For a data science , we need too understand two things

1. Storage

2. Computational Speed (to build efficient model or train model)

🞂 Companies stores the data and analyze the data and find insights from that data.

🞂 Example : A Small retail Store produce – 10mb data per day.

🞂 What can they do with this data ??

🞂 Find :

🞂 Maximum Profit

🞂 Product to stock

🞂 Who are buying, etc..


Applications in Business
🞂 Data Science helps organizations:

🞂 Predict future trends.

🞂 Optimize operations.

🞂 Improve customer experience.

🞂 Enhance decision-making processes.

🞂 Examples :

🞂 Marketing: Customer segmentation, ad optimization.

🞂 E-commerce: Product recommendations based on user behavior.

🞂 Finance: Fraud detection using transaction patterns.

🞂 Healthcare: Disease prediction using patient history, improving diagnostics.

🞂 Retail: Inventory management, personalized recommendations.


Use Cases for Data Science
🞂 Predictive Analytics: Forecasting future events, trends and customer behaviors, e.g.,
demand forecasting.

🞂 Natural Language Processing (NLP): Sentiment analysis, chatbots, and translation


tools.

🞂 Image Recognition: Autonomous vehicles, medical imaging.

🞂 Risk Analysis: Assessing creditworthiness or investment risks.

🞂 Healthcare: Disease prediction, drug discovery, and personalized treatments.

🞂 Finance: Fraud detection, credit scoring, and risk management.

🞂 Retail: Demand forecasting, inventory optimization, and customer segmentation.


Use Cases for Data Science
🞂 Technology: Recommendation systems, natural language processing, and
automation.

🞂 Recommendation Systems: Powering personalized suggestions in platforms like


Netflix or Amazon.

🞂 Risk Management: Identifying and mitigating risks in finance and operations.

🞂 Image and Speech Recognition: Enhancing AI-driven tools such as virtual


assistants.
Roles
🞂 Data Engineer:

🞂 Handle Data
🞂 Ensuring data quality and consistency across multiple sources
🞂 Working with data scientists to ensure the accuracy and consistency of the data
used for analysis
🞂 Designing and implementing data pipelines to collect and process large amounts of
data
🞂 Analyse and organise raw data
🞂 Managing and optimizing data storage technologies such as Hadoop, NoSQL, and
SQL databases
🞂 Staying up-to-date with the latest data storage technologies and best practices
Roles
🞂 Data Analyst:
🞂 Analysis the data
🞂 Extract relevant business data from primary and secondary sources
🞂 Develop and maintain databases and data systems
🞂 Identifying patterns and trends in data to drive business decisions
🞂 Data Scientists:
🞂 Predict the data
🞂 Build some predictive model
🞂 Use machine learning tools to create and optimise data classifiers
🞂 Collecting and cleaning large data sets
🞂 Staying up-to-date with the latest data science techniques and technologies
🞂 Collaborate with business and information technology (IT) teams to optimise
processes using derived insights
Roles
🞂 Difference between Data Scientist ,Data Analyst and Data Engineer
Roles-Skill Sets
🞂 The different skill sets required for Data Analyst, Data Engineer and Data Scientist:
Roles and Responsibilities

🞂 The roles and responsibilities of a data analyst, data engineer and data scientist are
quite similar as their skill-sets.
Data Science and Related Fields
🞂 Big Data:

🞂 Tools like Hadoop and Spark handle large-scale data processing.

🞂 Machine Learning:

🞂 A key toolset within data science for predictive modeling.


Data Science and Big Data
🞂 Big Data is a massive collection of data that continues to grow dramatically over time.

🞂 Big Data is like regular data, but it is much larger.

🞂 A data which are very large in size.

🞂 Normally we work on data of size MB( Word Doc ,Excel) or maximum GB(Movies,
Codes) but data in Peta bytes i.e. 1015 byte size is called Big Data.

🞂 Data science techniques and tools (like Hadoop, Spark, and cloud platforms) enable
handling and analyzing Big Data to derive actionable insights.

🞂 Or else we can say that, Data Science utilizes Big Data technologies like Hadoop,
Spark, and cloud storage to manage and analyze these datasets.
Data Science and Big Data
Sources of Big Data
Posts, Photos Videos, Likes and Traffic data & GPS Signals
Comments on Social Media

Software logs, camera and microphone

Emails, Blogs and e-news

Digital Pictures & Videos

Huge data from Weather station


and satellite that stored and manipulated to
forecasting
Big Data Characteristics (7 V’s)
🞂 Big Data refers to massive datasets characterized by:
Big Data Characteristics (7 V’s)
🞂 Volume :
🞂 Volume represents the volume or amount of data that is growing at a high rate.
🞂 Size of the data
🞂 i.e. data volume in Petabytes.
🞂 Value :
🞂 Value refers to turning data into value.
🞂 By turning accessed big data into values, businesses may generate revenue.
🞂 Veracity :
🞂 Veracity refers to the uncertainty of available data.
🞂 Veracity arises due to the high volume of data that brings incompleteness and
inconsistency.
🞂 Refers to the trust worthiness of the data. OR Accuracy and quality of data.
Big Data Characteristics (7 V’s)
🞂 Visualization :

🞂 Visualization is the process of displaying data in charts, graphs, maps, and other
visual forms.

🞂 Variety :

🞂 Variety refers to the different data types

🞂 i.e. various data formats like text, audios, videos, etc.

🞂 Velocity :

🞂 Velocity is the rate at which data grows.

🞂 Social media contributes a major role in the velocity of growing data.

🞂 Frequency of incoming data that need to be process.

🞂 Rate of speed of data process (fast process)

🞂 Speed of data generation.


Big Data Characteristics (7 V’s)
🞂 Virality :

🞂 Virality describes how quickly information gets spread across people to people
(P2P) networks.
How To generate Big Data ?
🞂 It can be generated by Machine as well as by humans.

🞂 Machine Generated Data :

🞂 i.e. Sensors, Machinery Industry or Vehicles

🞂 Submarine – radio Antenna, radar

🞂 human Generated Data :

🞂 i.e. Active on Social Media


Types of Big Data
1. Unstructured

2. Semi-structured

3. Structured
Unstructured
🞂 Any data with unknown form or the structure is classified as unstructured data.

🞂 Due to its vast size, unstructured data presents significant challenges in processing
and extracting valuable insights.

🞂 A common example of unstructured data is a heterogeneous data source that


combines various formats, such as text files, images, and videos, like those generated
by search engines such as Google.

🞂 Although organizations today possess an abundance of data, they often struggle to


derive meaningful value from it because the data exists in its raw, unstructured state.

Human Generated Data Machine Generated Data


Structured
🞂 Any data that can be stored, accessed and processed in the form of fixed format is
termed as a "Structured" data.

🞂 Data stored in a relational database management system in one example of a


structured data.

🞂 Student_Table :

Student_ID Student_Name Gender Department Division Percentage

1 XYX MALE CSE A 90.00

2 ABC MALE CSE C 79.00

3 PQR FEMALE CSE E 82.00


4 MNR FEMALE CSE B 86.00
Semi-structured
🞂 Semi structured is the third type of big data.

🞂 Semi-structured data can contain both the forms of data.

🞂 Semi-structured data pertains to the data containing both the formats mentioned
above, that is, structured and unstructured data.

🞂 User can see semi-structured data as a structured in form but it is actually not defined
with e.g. a table definition in relational DBMS.

🞂 Personal data stored in a XML file:


<rec><name>Prashant Rao</name><sex>Male</sex><age>35</age></rec>

<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>

<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
Big Data Type

Structured Data
Unstructured Data

Semi Structured Data


Differences of data
Factors Structured Data Semi-Structured Data Unstructured Data

It is more flexible than


It is flexible in nature
It is dependent and less structured data but less
Flexibility and there is an
flexible than flexible than
absence of a schema
unstructured data
Matured transaction and No transaction
Transaction The transaction is adapted
various concurrency management and no
Management from DBMS not matured
technique concurrency
Query Structured query allow Queries over anonymous An only textual query
performance complex joining nodes are possible is possible
This is based on
It is based on the It is based on RDF and
Technology character and library
relational database table XML
data
Data Science and Machine Learning
🞂 Machine Learning is a core component of Data Science and involves creating
algorithms that improve with data over time.

🞂 Types of machine learning include:

🞂 Supervised Learning:
🞂 Prediction based on labeled data (e.g., house price prediction).

🞂 Unsupervised Learning:
🞂 Discovering hidden patterns in unlabeled data (e.g., customer segmentation).

🞂 Reinforcement Learning:
🞂 Learning through feedback in dynamic environments (e.g., game- playing AI).
Data Science Process Overview (DSLC)
🞂 Defining Goals :

🞂 Understand the business problem or define research questions.

🞂 Understand the problem or business objective.

🞂 Set measurable and specific goals.

🞂 Retrieving Data :

🞂 Gather relevant data from sources.

🞂 From Various Sources: Databases, APIs, web scraping, or data warehouses.

🞂 Tools: SQL, Python, R.

🞂 Data Preparation:

🞂 Cleaning: Handle missing or duplicate data and fill missing values..

🞂 Transformation: Normalize or scale data and handle outliers.

🞂 Tools: Pandas, NumPy, ETL tools.


Data Science Process Overview (DSLC)
🞂 Data Exploration :

🞂 Use statistical methods and visualizations to understand patterns and relationships in


the data.

🞂 Visualization: Use charts to uncover trends.

🞂 Statistical Analysis: Identify relationships and distributions.

🞂 Tools: Matplotlib, Seaborn, Tableau.

🞂 Data Modeling

🞂 Apply algorithms (e.g., regression, classification, or clustering.).

🞂 Train and validate models to ensure they generalize well.

🞂 Evaluate model performance using metrics like accuracy or RMSE.

🞂 Tools: Scikit-learn, TensorFlow, PyTorch.


Data Science Process Overview (DSLC)
🞂 Presentation :

🞂 Create dashboards, charts, or written reports for stakeholders.

🞂 Dashboards: Summarize findings interactively.

🞂 Reports: Present actionable insights clearly.

🞂 Tools: Power BI, Tableau, Jupyter Notebooks.


Data Science Life Cycle (DSLC)
Data Science Life Cycle (DSLC)
Data Science Life Cycle (DSLC)
🞂 The data science life cycle is a systematic approach to managing a data science
project. It consists of several stages, each with its own set of tasks and objectives.

🞂 Here are the typical steps involved in the data science life cycle:

🞂 Define the problem

🞂 This is the first and most critical step in the data science life cycle. It involves clearly
defining the problem or question that the data science project aims to address. For
example, a company may want to predict customer churn based on their purchasing
behavior. During the “define the problem” phase of the data science life cycle, some of
the questions that you may ask include:
Data Science Life Cycle (DSLC)
1. What is the business problem that needs to be solved?
2. What are the goals and objectives of the project?
3. What data is available to solve the problem?
4. What are the constraints or limitations of the project?
5. Who are the stakeholders involved in the project?
6. What are the assumptions made while defining the problem?
7. What is the scope of the project?
8. What is the timeline for completing the project?
9. What are the risks associated with the project?
10. What are the ethical considerations that need to be taken into account during the project?
🞂 The “define the problem” phase in the data science life cycle is typically more of
a conceptual phase and does not involve the use of any specific tools or software.
Data Science Life Cycle (DSLC)
🞂 Data collection

🞂 Once the problem is defined, the next step is to collect relevant data. This can involve
gathering data from internal or external sources, or acquiring data through web
scraping or other methods.

🞂 For example, a company may collect data on customer transactions and


demographics to predict customer churn.

🞂 During the “data collection” phase of the data science life cycle, some of the questions
that you may ask include:
Data Science Life Cycle (DSLC)
1. What data sources are available and what is the format of the data?
2. What is the volume of data available and how frequently is it collected?
3. How was the data collected and what is the quality of the data?
4. What are the missing or incomplete data points, and how will they be handled?
5. What are the ethical considerations that need to be taken into account while collecting
the data?
6. What are the legal and regulatory requirements for collecting the data?
7. What are the limitations and biases associated with the data?
8. What data preprocessing and cleaning steps are needed to ensure that the data is
suitable for analysis?
9. How will the data be stored and managed during the project?
10.How will the data be accessed and shared with other team members involved in the
project?
Data Science Life Cycle (DSLC)
🞂 Tools used for Data Collection :

🞂 The “data collection” phase of the data science life cycle involves the process of
collecting, acquiring, and gathering data from various sources. The tools used for this
phase may vary depending on the data sources, the type of data, and the scale of the
project. Here are some examples of tools commonly used for data collection:

1. Web scraping tools such as Beautiful Soup and Scrapy for collecting data from
websites.

2. Survey tools such as SurveyMonkey and Google Forms for collecting survey data.

3. Data collection and management platforms such as Qualtrics and Amazon


Mechanical Turk.

4. Database management systems such as MySQL and PostgreSQL for collecting and
storing structured data.
Data Science Life Cycle (DSLC)
1. Big data platforms such as Hadoop and Apache Spark for processing large volumes
of unstructured data.

2. APIs for accessing data from social media platforms such as Twitter and Facebook.

3. Sensors and IoT devices for collecting data in real-time.

4. File formats such as CSV, Excel, JSON, and XML for storing and transferring data.

🞂 It’s important to choose the appropriate tools for data collection that best fit the project
requirements and objectives. Additionally, it’s crucial to ensure that the data collection
process is ethical, legal, and secure.
Data Science Life Cycle (DSLC)
🞂 Data preparation:

🞂 Raw data is often messy and requires cleaning and preprocessing to prepare it for
analysis.

🞂 This step involves tasks such as removing duplicates, filling missing values, and
transforming data types.

🞂 For example, in the customer churn project, data preparation may involve removing
invalid records or imputing missing values.

🞂 During the “data preparation” phase of the data science life cycle, the main goal is to
clean and transform the raw data into a usable format for analysis.

🞂 Here are some examples of questions that may be covered during this phase:
Data Science Life Cycle (DSLC)
1. Is the data complete? Are there missing values or outliers that need to be
addressed?
2. Are there any inconsistencies or errors in the data that need to be corrected?
3. Do the variables need to be transformed or scaled in order to fit the model
assumptions?
4. Do any categorical variables need to be converted into numerical values or one-
hot encoded?
5. Are there any duplicated or redundant observations that need to be removed?
6. Do we need to merge multiple data sources together to create a unified dataset?
7. Are there any additional features that need to be created from the existing data?
8. Does the data need to be sampled or aggregated to reduce its size or
complexity?
Data Science Life Cycle (DSLC)
🞂 Tools used for Data Preparation:

🞂 The goal of the data preparation phase is to ensure that the data is in a suitable format
for analysis and modeling.

🞂 This phase is critical because the quality of the data used for analysis directly impacts
the accuracy and reliability of the results.

🞂 There are several tools and software that can be used for the “data preparation”
phase in the data science life cycle.

🞂 Here are some commonly used tools:

1. OpenRefine: OpenRefine is a free and open-source tool for cleaning and transforming
messy data. It allows you to explore and transform large datasets quickly and easily.
Data Science Life Cycle (DSLC)
🞂 Tools used for Data Preparation:

1. Python libraries: There are several Python libraries that are commonly used for data
preparation, including Pandas, NumPy, and SciPy. These libraries provide a wide
range of functions for data cleaning, transformation, and manipulation.

2. R programming: R is a popular programming language for data analysis, and it has


several libraries and packages that can be used for data preparation.

3. Excel: Excel is a widely used spreadsheet software that can be used for simple data
preparation tasks such as filtering, sorting, and data cleaning.

4. SQL: SQL is a database management language that can be used to extract and
manipulate data from databases.

🞂 The choice of tool will depend on the specific requirements of the project, the size of
the dataset, and the skill set of the data scientist.
Data Science Life Cycle (DSLC)
🞂 Data exploration:

🞂 Once the data is cleaned and preprocessed, the next step is to explore the data to
gain insights and identify patterns.

🞂 This can involve tasks such as data visualization, statistical analysis, and hypothesis
testing.

🞂 For example, in the customer churn project, data exploration may involve creating
visualizations of customer purchasing behavior to identify trends.

🞂 The “data exploration” phase in the data science life cycle involves analyzing and
visualizing the data to gain insights and understanding.

🞂 Some of the questions that may be covered during this phase include:
Data Science Life Cycle (DSLC)
🞂 Data exploration:

1. What are the key features and characteristics of the dataset?

2. Are there any patterns or trends in the data?

3. Are there any outliers or anomalies in the data?

4. What is the distribution of the data?

5. Are there any correlations between different variables in the data?

6. What are the most important variables in the data?

7. Are there any missing values or inconsistencies in the data?

8. What is the size and complexity of the dataset?


Data Science Life Cycle (DSLC)
🞂 Data exploration:

🞂 The answers to these questions will help data scientists to understand the structure
and content of the data, identify any issues or challenges, and develop strategies for
further analysis and modeling. The data exploration phase is an important step in the
data science life cycle, as it lays the foundation for subsequent phases such as data
modeling and evaluation.
Data Science Life Cycle (DSLC)
🞂 Tools used for Data Preparation:

🞂 The goal of the data preparation phase is to ensure that the data is in a suitable format
for analysis and modeling. This phase is critical because the quality of the data used
for analysis directly impacts the accuracy and reliability of the results. There
are several tools and software that can be used for the “data preparation” phase in
the data science life cycle. Here are some commonly used tools:

1. OpenRefine: OpenRefine is a free and open-source tool for cleaning and transforming
messy data. It allows you to explore and transform large datasets quickly and easily.
Data Science Life Cycle (DSLC)
🞂 Tools used for Data Preparation:

1. Python libraries: There are several Python libraries that are commonly used for data
preparation, including Pandas, NumPy, and SciPy. These libraries provide a wide
range of functions for data cleaning, transformation, and manipulation.

2. R programming: R is a popular programming language for data analysis, and it has


several libraries and packages that can be used for data preparation.

3. Excel: Excel is a widely used spreadsheet software that can be used for simple data
preparation tasks such as filtering, sorting, and data cleaning.

4. SQL: SQL is a database management language that can be used to extract and
manipulate data from databases.

🞂 The choice of tool will depend on the specific requirements of the project, the size of
the dataset, and the skill set of the data scientist.
Data Science Life Cycle (DSLC)
🞂 Tools used for Data Exploration :

🞂 The “data exploration” phase in the data science life cycle involves analyzing and
visualizing the data to gain insights and understanding.

🞂 There are many tools and software that data scientists can use for this phase,
depending on their specific needs and preferences.

🞂 Some of the commonly used tools for data exploration include:

1. Python libraries such as Pandas, NumPy, and Matplotlib

2. R programming language and packages such as ggplot2 and dplyr

3. Tableau and Power BI for creating interactive visualizations


Data Science Life Cycle (DSLC)
🞂 Tools used for Data Exploration :

1. Excel for basic data analysis and visualization

2. Jupyter Notebook for creating and sharing data analysis workflows

3. RapidMiner for data mining and predictive analytics

4. IBM Watson Studio and Google Colab for cloud-based data analysis and
collaboration
5. These tools provide various functionalities for data exploration, including data
cleaning, transformation, visualization, and statistical analysis.
6. They allow data scientists to interact with the data and explore it in different ways,
enabling them to identify patterns, relationships, and insights that can inform further
analysis and modeling.
Data Science Life Cycle (DSLC)
🞂 Model building:

🞂 Based on the insights gained from data exploration, the next step is to build a
predictive model that can be used to solve the problem.

🞂 This can involve selecting an appropriate algorithm, training the model on the data,
and tuning the model parameters.

🞂 For example, in the customer churn project, model building may involve training a
logistic regression model to predict the likelihood of a customer churning.

🞂 The “model building” phase in the data science life cycle involves selecting and
developing appropriate models to analyze the data and make predictions or decisions.
Data Science Life Cycle (DSLC)
During this phase, data scientists may ask the following types of questions:

1. What type of model is appropriate for the problem we are trying to solve?

2. What features or variables should be included in the model?

3. How do we ensure the model is accurate and reliable?

4. What algorithms or techniques should we use to develop the model?

5. How do we evaluate the performance of the model?

6. How can we optimize the model for better accuracy or efficiency


Data Science Life Cycle (DSLC)
🞂 The specific questions asked during this phase will depend on the particular problem,
data, and modeling techniques being used. The goal of the model building phase is to
create a model that accurately represents the data and can be used to make
predictions or decisions with confidence.
Data Science Life Cycle (DSLC)
🞂 Tools used for Model Building :

🞂 There are several tools and libraries available for model building in data science,
depending on the specific requirements of the project.

🞂 Some commonly used tools and libraries include:

1. Python: Python is a popular programming language for data science and offers a
variety of libraries for model building, such as scikit-learn, TensorFlow, and Keras.

2. R: R is another popular programming language for data science and offers a variety of
packages for model building, such as caret, randomForest, and xgboost.

3. MATLAB: MATLAB is a numerical computing environment that offers a variety of tools


and functions for model building.
Data Science Life Cycle (DSLC)
🞂 Tools used for Model Building :

1. RapidMiner: RapidMiner is an open-source data science platform that offers a variety


of tools and functions for model building, including data preprocessing, visualization,
and machine learning.

2. KNIME: KNIME is an open-source data science platform that offers a variety of tools
and functions for model building, including data preprocessing, visualization, and
machine learning.

3. SAS: SAS is a proprietary software suite that offers a variety of tools and functions for
model building, including data preprocessing, visualization, and machine learning.
Data Science Life Cycle (DSLC)
🞂 Model evaluation:

🞂 Once the model is built, it needs to be evaluated to determine its effectiveness.

🞂 This can involve tasks such as cross-validation, testing the model on new data, and
evaluating metrics such as accuracy, precision, and recall.

🞂 For example, in the customer churn project, model evaluation may involve testing the
logistic regression model on a holdout dataset to determine its accuracy.

🞂 During the “model evaluation” phase of the data science life cycle, data scientists
typically ask questions that help them assess the quality and effectiveness of the
models they have built.
Data Science Life Cycle (DSLC)
🞂 Model evaluation:

🞂 Some of the questions that may be covered in this phase include:

1. How well does the model fit the data?

2. What is the accuracy of the model?

3. Are there any biases or errors in the model?

4. How does the model perform on new or unseen data?

5. Are there any improvements or adjustments that can be made to the model?

🞂 The goal of the model evaluation phase is to ensure that the model is robust, accurate,
and effective in solving the problem it was designed to address. This phase helps data
scientists determine whether the model is ready for deployment and use in real-world
applications.
Data Science Life Cycle (DSLC)
🞂 Tools used for Model Evaluation:

🞂 There are many tools that can be used for model evaluation in data science. Some of
the commonly used ones are:

1. Scikit-learn: This is a popular machine learning library in Python that provides a wide
range of algorithms and evaluation metrics for model evaluation.

2. TensorFlow: This is an open-source library for machine learning developed by


Google. It provides tools for building and training machine learning models and also
has evaluation metrics for model evaluation.

3. Keras: This is a high-level neural networks library that can run on top of TensorFlow. It
provides evaluation metrics for model evaluation.
Data Science Life Cycle (DSLC)
🞂 Tools used for Model Evaluation:

1. R: This is a programming language commonly used for statistical computing and


graphics. It provides a wide range of packages and functions for model evaluation.

2. Excel: This is a spreadsheet software that can be used for basic statistical analysis
and model evaluation.

3. Tableau: This is a data visualization tool that can be used to visualize model results
and evaluate model performance.
Data Science Life Cycle (DSLC)
🞂 Model deployment:

🞂 The final step in the data science life cycle is to deploy the model into a production
environment where it can be used to solve the problem.

🞂 This can involve integrating the model with other systems, creating a user interface,
and monitoring the model performance over time.

🞂 For example, in the customer churn project, model deployment may involve integrating
the logistic regression model into a customer relationship management (CRM) system
to identify customers at risk of churning.

🞂 During the “model deployment” phase of the data science life cycle, data scientists
typically ask questions that help them ensure that the model is implemented and used
effectively in real-world scenarios.
Data Science Life Cycle (DSLC)
🞂 Model deployment:

🞂 Some of the questions that may be covered in this phase include:

1. How will the model be integrated into the existing system or workflow?

2. What resources are required to support the model in a production environment?

3. How will the model be monitored and maintained over time?

4. What are the potential risks or challenges associated with deploying the model?

5. How will the performance of the model be measured and evaluated once it is in use?
Data Science Life Cycle (DSLC)

🞂 The goal of the model deployment phase is to ensure that the model is implemented
smoothly and effectively, and that it continues to deliver value and solve the problem it
was designed to address over time. This phase involves collaboration with various
stakeholders, including IT teams, end-users, and management, to ensure that the
model is integrated and used effectively within the organization
Data Science Life Cycle (DSLC)
🞂 Tools used for Model Deployment:
🞂 The choice of tool for model deployment in data science depends on the specific
requirements of the project and the infrastructure available. However, some common
tools used for model deployment in data science include:
1. Docker: Docker is an open-source platform that allows developers to package and
deploy applications in containers. It is often used for deploying machine learning
models in a portable and scalable way.
2. Kubernetes: Kubernetes is an open-source platform for automating deployment,
scaling, and management of containerized applications. It is often used for deploying
machine learning models in production environments.
Data Science Life Cycle (DSLC)
🞂 Tools used for Model Deployment:

1. TensorFlow Serving: TensorFlow Serving is an open-source software library for


serving machine learning models. It is often used for deploying TensorFlow models in
production environments.
2. Flask/Django: Flask and Django are popular web frameworks for building web
applications. They can be used to build RESTful APIs for serving machine learning
models.

🞂 These are the typical steps involved in the data science life cycle. The exact details of
each step may vary depending on the project and the specific needs of the
organization.
THANK YOU

You might also like