Data Science Unit 1
Data Science Unit 1
Unit 1
BY, SR,BP,UR (CSE-IITE)
Content
🞂 Introduction to data Science
🞂 It deals with structured (e.g., tables) and unstructured (e.g., images, text) data to solve
real-world problems.
🞂 The main Goal of Data Science to gain insights from any type of data.
Data Science in Business
🞂 What Do Data Science Professionals Do?
🞂 Retrieve data from various sources like databases, APIs, or IoT devices, web
scraping
🞂 Model Development:
1. Storage
🞂 Companies stores the data and analyze the data and find insights from that data.
🞂 Find :
🞂 Maximum Profit
🞂 Product to stock
🞂 Optimize operations.
🞂 Examples :
🞂 Handle Data
🞂 Ensuring data quality and consistency across multiple sources
🞂 Working with data scientists to ensure the accuracy and consistency of the data
used for analysis
🞂 Designing and implementing data pipelines to collect and process large amounts of
data
🞂 Analyse and organise raw data
🞂 Managing and optimizing data storage technologies such as Hadoop, NoSQL, and
SQL databases
🞂 Staying up-to-date with the latest data storage technologies and best practices
Roles
🞂 Data Analyst:
🞂 Analysis the data
🞂 Extract relevant business data from primary and secondary sources
🞂 Develop and maintain databases and data systems
🞂 Identifying patterns and trends in data to drive business decisions
🞂 Data Scientists:
🞂 Predict the data
🞂 Build some predictive model
🞂 Use machine learning tools to create and optimise data classifiers
🞂 Collecting and cleaning large data sets
🞂 Staying up-to-date with the latest data science techniques and technologies
🞂 Collaborate with business and information technology (IT) teams to optimise
processes using derived insights
Roles
🞂 Difference between Data Scientist ,Data Analyst and Data Engineer
Roles-Skill Sets
🞂 The different skill sets required for Data Analyst, Data Engineer and Data Scientist:
Roles and Responsibilities
🞂 The roles and responsibilities of a data analyst, data engineer and data scientist are
quite similar as their skill-sets.
Data Science and Related Fields
🞂 Big Data:
🞂 Machine Learning:
🞂 Normally we work on data of size MB( Word Doc ,Excel) or maximum GB(Movies,
Codes) but data in Peta bytes i.e. 1015 byte size is called Big Data.
🞂 Data science techniques and tools (like Hadoop, Spark, and cloud platforms) enable
handling and analyzing Big Data to derive actionable insights.
🞂 Or else we can say that, Data Science utilizes Big Data technologies like Hadoop,
Spark, and cloud storage to manage and analyze these datasets.
Data Science and Big Data
Sources of Big Data
Posts, Photos Videos, Likes and Traffic data & GPS Signals
Comments on Social Media
🞂 Visualization is the process of displaying data in charts, graphs, maps, and other
visual forms.
🞂 Variety :
🞂 Velocity :
🞂 Virality describes how quickly information gets spread across people to people
(P2P) networks.
How To generate Big Data ?
🞂 It can be generated by Machine as well as by humans.
2. Semi-structured
3. Structured
Unstructured
🞂 Any data with unknown form or the structure is classified as unstructured data.
🞂 Due to its vast size, unstructured data presents significant challenges in processing
and extracting valuable insights.
🞂 Student_Table :
🞂 Semi-structured data pertains to the data containing both the formats mentioned
above, that is, structured and unstructured data.
🞂 User can see semi-structured data as a structured in form but it is actually not defined
with e.g. a table definition in relational DBMS.
<rec><name>Seema R.</name><sex>Female</sex><age>41</age></rec>
<rec><name>Satish Mane</name><sex>Male</sex><age>29</age></rec>
Big Data Type
Structured Data
Unstructured Data
🞂 Supervised Learning:
🞂 Prediction based on labeled data (e.g., house price prediction).
🞂 Unsupervised Learning:
🞂 Discovering hidden patterns in unlabeled data (e.g., customer segmentation).
🞂 Reinforcement Learning:
🞂 Learning through feedback in dynamic environments (e.g., game- playing AI).
Data Science Process Overview (DSLC)
🞂 Defining Goals :
🞂 Retrieving Data :
🞂 Data Preparation:
🞂 Data Modeling
🞂 Here are the typical steps involved in the data science life cycle:
🞂 This is the first and most critical step in the data science life cycle. It involves clearly
defining the problem or question that the data science project aims to address. For
example, a company may want to predict customer churn based on their purchasing
behavior. During the “define the problem” phase of the data science life cycle, some of
the questions that you may ask include:
Data Science Life Cycle (DSLC)
1. What is the business problem that needs to be solved?
2. What are the goals and objectives of the project?
3. What data is available to solve the problem?
4. What are the constraints or limitations of the project?
5. Who are the stakeholders involved in the project?
6. What are the assumptions made while defining the problem?
7. What is the scope of the project?
8. What is the timeline for completing the project?
9. What are the risks associated with the project?
10. What are the ethical considerations that need to be taken into account during the project?
🞂 The “define the problem” phase in the data science life cycle is typically more of
a conceptual phase and does not involve the use of any specific tools or software.
Data Science Life Cycle (DSLC)
🞂 Data collection
🞂 Once the problem is defined, the next step is to collect relevant data. This can involve
gathering data from internal or external sources, or acquiring data through web
scraping or other methods.
🞂 During the “data collection” phase of the data science life cycle, some of the questions
that you may ask include:
Data Science Life Cycle (DSLC)
1. What data sources are available and what is the format of the data?
2. What is the volume of data available and how frequently is it collected?
3. How was the data collected and what is the quality of the data?
4. What are the missing or incomplete data points, and how will they be handled?
5. What are the ethical considerations that need to be taken into account while collecting
the data?
6. What are the legal and regulatory requirements for collecting the data?
7. What are the limitations and biases associated with the data?
8. What data preprocessing and cleaning steps are needed to ensure that the data is
suitable for analysis?
9. How will the data be stored and managed during the project?
10.How will the data be accessed and shared with other team members involved in the
project?
Data Science Life Cycle (DSLC)
🞂 Tools used for Data Collection :
🞂 The “data collection” phase of the data science life cycle involves the process of
collecting, acquiring, and gathering data from various sources. The tools used for this
phase may vary depending on the data sources, the type of data, and the scale of the
project. Here are some examples of tools commonly used for data collection:
1. Web scraping tools such as Beautiful Soup and Scrapy for collecting data from
websites.
2. Survey tools such as SurveyMonkey and Google Forms for collecting survey data.
4. Database management systems such as MySQL and PostgreSQL for collecting and
storing structured data.
Data Science Life Cycle (DSLC)
1. Big data platforms such as Hadoop and Apache Spark for processing large volumes
of unstructured data.
2. APIs for accessing data from social media platforms such as Twitter and Facebook.
4. File formats such as CSV, Excel, JSON, and XML for storing and transferring data.
🞂 It’s important to choose the appropriate tools for data collection that best fit the project
requirements and objectives. Additionally, it’s crucial to ensure that the data collection
process is ethical, legal, and secure.
Data Science Life Cycle (DSLC)
🞂 Data preparation:
🞂 Raw data is often messy and requires cleaning and preprocessing to prepare it for
analysis.
🞂 This step involves tasks such as removing duplicates, filling missing values, and
transforming data types.
🞂 For example, in the customer churn project, data preparation may involve removing
invalid records or imputing missing values.
🞂 During the “data preparation” phase of the data science life cycle, the main goal is to
clean and transform the raw data into a usable format for analysis.
🞂 Here are some examples of questions that may be covered during this phase:
Data Science Life Cycle (DSLC)
1. Is the data complete? Are there missing values or outliers that need to be
addressed?
2. Are there any inconsistencies or errors in the data that need to be corrected?
3. Do the variables need to be transformed or scaled in order to fit the model
assumptions?
4. Do any categorical variables need to be converted into numerical values or one-
hot encoded?
5. Are there any duplicated or redundant observations that need to be removed?
6. Do we need to merge multiple data sources together to create a unified dataset?
7. Are there any additional features that need to be created from the existing data?
8. Does the data need to be sampled or aggregated to reduce its size or
complexity?
Data Science Life Cycle (DSLC)
🞂 Tools used for Data Preparation:
🞂 The goal of the data preparation phase is to ensure that the data is in a suitable format
for analysis and modeling.
🞂 This phase is critical because the quality of the data used for analysis directly impacts
the accuracy and reliability of the results.
🞂 There are several tools and software that can be used for the “data preparation”
phase in the data science life cycle.
1. OpenRefine: OpenRefine is a free and open-source tool for cleaning and transforming
messy data. It allows you to explore and transform large datasets quickly and easily.
Data Science Life Cycle (DSLC)
🞂 Tools used for Data Preparation:
1. Python libraries: There are several Python libraries that are commonly used for data
preparation, including Pandas, NumPy, and SciPy. These libraries provide a wide
range of functions for data cleaning, transformation, and manipulation.
3. Excel: Excel is a widely used spreadsheet software that can be used for simple data
preparation tasks such as filtering, sorting, and data cleaning.
4. SQL: SQL is a database management language that can be used to extract and
manipulate data from databases.
🞂 The choice of tool will depend on the specific requirements of the project, the size of
the dataset, and the skill set of the data scientist.
Data Science Life Cycle (DSLC)
🞂 Data exploration:
🞂 Once the data is cleaned and preprocessed, the next step is to explore the data to
gain insights and identify patterns.
🞂 This can involve tasks such as data visualization, statistical analysis, and hypothesis
testing.
🞂 For example, in the customer churn project, data exploration may involve creating
visualizations of customer purchasing behavior to identify trends.
🞂 The “data exploration” phase in the data science life cycle involves analyzing and
visualizing the data to gain insights and understanding.
🞂 Some of the questions that may be covered during this phase include:
Data Science Life Cycle (DSLC)
🞂 Data exploration:
🞂 The answers to these questions will help data scientists to understand the structure
and content of the data, identify any issues or challenges, and develop strategies for
further analysis and modeling. The data exploration phase is an important step in the
data science life cycle, as it lays the foundation for subsequent phases such as data
modeling and evaluation.
Data Science Life Cycle (DSLC)
🞂 Tools used for Data Preparation:
🞂 The goal of the data preparation phase is to ensure that the data is in a suitable format
for analysis and modeling. This phase is critical because the quality of the data used
for analysis directly impacts the accuracy and reliability of the results. There
are several tools and software that can be used for the “data preparation” phase in
the data science life cycle. Here are some commonly used tools:
1. OpenRefine: OpenRefine is a free and open-source tool for cleaning and transforming
messy data. It allows you to explore and transform large datasets quickly and easily.
Data Science Life Cycle (DSLC)
🞂 Tools used for Data Preparation:
1. Python libraries: There are several Python libraries that are commonly used for data
preparation, including Pandas, NumPy, and SciPy. These libraries provide a wide
range of functions for data cleaning, transformation, and manipulation.
3. Excel: Excel is a widely used spreadsheet software that can be used for simple data
preparation tasks such as filtering, sorting, and data cleaning.
4. SQL: SQL is a database management language that can be used to extract and
manipulate data from databases.
🞂 The choice of tool will depend on the specific requirements of the project, the size of
the dataset, and the skill set of the data scientist.
Data Science Life Cycle (DSLC)
🞂 Tools used for Data Exploration :
🞂 The “data exploration” phase in the data science life cycle involves analyzing and
visualizing the data to gain insights and understanding.
🞂 There are many tools and software that data scientists can use for this phase,
depending on their specific needs and preferences.
4. IBM Watson Studio and Google Colab for cloud-based data analysis and
collaboration
5. These tools provide various functionalities for data exploration, including data
cleaning, transformation, visualization, and statistical analysis.
6. They allow data scientists to interact with the data and explore it in different ways,
enabling them to identify patterns, relationships, and insights that can inform further
analysis and modeling.
Data Science Life Cycle (DSLC)
🞂 Model building:
🞂 Based on the insights gained from data exploration, the next step is to build a
predictive model that can be used to solve the problem.
🞂 This can involve selecting an appropriate algorithm, training the model on the data,
and tuning the model parameters.
🞂 For example, in the customer churn project, model building may involve training a
logistic regression model to predict the likelihood of a customer churning.
🞂 The “model building” phase in the data science life cycle involves selecting and
developing appropriate models to analyze the data and make predictions or decisions.
Data Science Life Cycle (DSLC)
During this phase, data scientists may ask the following types of questions:
1. What type of model is appropriate for the problem we are trying to solve?
🞂 There are several tools and libraries available for model building in data science,
depending on the specific requirements of the project.
1. Python: Python is a popular programming language for data science and offers a
variety of libraries for model building, such as scikit-learn, TensorFlow, and Keras.
2. R: R is another popular programming language for data science and offers a variety of
packages for model building, such as caret, randomForest, and xgboost.
2. KNIME: KNIME is an open-source data science platform that offers a variety of tools
and functions for model building, including data preprocessing, visualization, and
machine learning.
3. SAS: SAS is a proprietary software suite that offers a variety of tools and functions for
model building, including data preprocessing, visualization, and machine learning.
Data Science Life Cycle (DSLC)
🞂 Model evaluation:
🞂 This can involve tasks such as cross-validation, testing the model on new data, and
evaluating metrics such as accuracy, precision, and recall.
🞂 For example, in the customer churn project, model evaluation may involve testing the
logistic regression model on a holdout dataset to determine its accuracy.
🞂 During the “model evaluation” phase of the data science life cycle, data scientists
typically ask questions that help them assess the quality and effectiveness of the
models they have built.
Data Science Life Cycle (DSLC)
🞂 Model evaluation:
5. Are there any improvements or adjustments that can be made to the model?
🞂 The goal of the model evaluation phase is to ensure that the model is robust, accurate,
and effective in solving the problem it was designed to address. This phase helps data
scientists determine whether the model is ready for deployment and use in real-world
applications.
Data Science Life Cycle (DSLC)
🞂 Tools used for Model Evaluation:
🞂 There are many tools that can be used for model evaluation in data science. Some of
the commonly used ones are:
1. Scikit-learn: This is a popular machine learning library in Python that provides a wide
range of algorithms and evaluation metrics for model evaluation.
3. Keras: This is a high-level neural networks library that can run on top of TensorFlow. It
provides evaluation metrics for model evaluation.
Data Science Life Cycle (DSLC)
🞂 Tools used for Model Evaluation:
2. Excel: This is a spreadsheet software that can be used for basic statistical analysis
and model evaluation.
3. Tableau: This is a data visualization tool that can be used to visualize model results
and evaluate model performance.
Data Science Life Cycle (DSLC)
🞂 Model deployment:
🞂 The final step in the data science life cycle is to deploy the model into a production
environment where it can be used to solve the problem.
🞂 This can involve integrating the model with other systems, creating a user interface,
and monitoring the model performance over time.
🞂 For example, in the customer churn project, model deployment may involve integrating
the logistic regression model into a customer relationship management (CRM) system
to identify customers at risk of churning.
🞂 During the “model deployment” phase of the data science life cycle, data scientists
typically ask questions that help them ensure that the model is implemented and used
effectively in real-world scenarios.
Data Science Life Cycle (DSLC)
🞂 Model deployment:
1. How will the model be integrated into the existing system or workflow?
4. What are the potential risks or challenges associated with deploying the model?
5. How will the performance of the model be measured and evaluated once it is in use?
Data Science Life Cycle (DSLC)
🞂 The goal of the model deployment phase is to ensure that the model is implemented
smoothly and effectively, and that it continues to deliver value and solve the problem it
was designed to address over time. This phase involves collaboration with various
stakeholders, including IT teams, end-users, and management, to ensure that the
model is integrated and used effectively within the organization
Data Science Life Cycle (DSLC)
🞂 Tools used for Model Deployment:
🞂 The choice of tool for model deployment in data science depends on the specific
requirements of the project and the infrastructure available. However, some common
tools used for model deployment in data science include:
1. Docker: Docker is an open-source platform that allows developers to package and
deploy applications in containers. It is often used for deploying machine learning
models in a portable and scalable way.
2. Kubernetes: Kubernetes is an open-source platform for automating deployment,
scaling, and management of containerized applications. It is often used for deploying
machine learning models in production environments.
Data Science Life Cycle (DSLC)
🞂 Tools used for Model Deployment:
🞂 These are the typical steps involved in the data science life cycle. The exact details of
each step may vary depending on the project and the specific needs of the
organization.
THANK YOU