1. Introduction to Data Science
1. Introduction to Data Science
Welcome to this course on Data Science. If you are active on Fb/Insta/X, share your
Data Science journey on these platforms and tag me. I would love to see your
progress.
I am so happy you decided to learn Data Science with me in my style. Lets step
back and talk a bit about why do we need data science. What kind of tasks we will
do as a data scientist in an organization
Simple Definition:
Let me give you a very simple non fancy definition of Data Science
Note: The content you just read will be supplied to you as a downloadable PDF.
Since I am writing a summarized version of the video content which will serve as
revision notes, I recommend you try to read them along with watching the videos.
Thanks and see you in the next lecture!
The Data Science Lifecycle refers to the structured process used to extract insights
from data. It involves several stages, from gathering raw data to delivering
actionable insights. Here is a breakdown of each step:
1. Problem Definition
Understanding the problem you want to solve.
Key Activities:
Key Activities:
Fun Fact: Data scientists spend 80% of their time cleaning data!
Key Activities:
5. Model Building
Creating and training machine learning models.
Key Activities:
6. Model Evaluation
Measuring model performance and accuracy.
Key Activities:
Key Metrics:
7. Deployment
Integrating the model into production systems.
Key Activities:
• Package the model for deployment (Usually done using web frameworks like
Flask, and FastAPI).
• Automate pipelines for continuous learning (MLOps).
• Monitor performance post-deployment.
Key Activities:
• Create dashboards
• Present findings clearly and concisely.
• Document the process and results.
Key Activities:
Summary
1. Problem Definition
2. Data Collection
3. Data Cleaning
4. Data Exploration
5. Model Building
6. Model Evaluation
7. Deployment
8. Communication & Reporting
9. Maintenance & Iteration
By following this lifecycle, data scientists transform raw data into meaningful
insights that drive better decision-making.
I am enjoying teaching so far. When working in data science, the right tools make
your work easier, faster, and more efficient. When I started my data science journey
at IIT Kharagpur, I used to code using Pycharm and regular Python installation. I
knew about Jupyter but wasn’t familiar with its capabilities. From writing code to
visualizing data, there are many options to choose from. Here is a breakdown of
popular data science tools and why Anaconda with Jupyter Notebook is an
excellent choice for beginners and advanced users.
The easiest way to run Python programs is by installing VS Code and using pip to
install packages but we will use Anaconda and Jupyter notebooks
1. Jupyter Notebook (with Anaconda Distribution)
An open-source web application that allows you to create and share
documents with live code, equations, visualizations, and text.
jupyter notebook
Don’t worry, we will do all these things step by step in the next section
2. Google Colab
A free, cloud-based Jupyter Notebook environment provided by Google.
Why Use Google Colab?
• Free GPU/TPU Access: Great for deep learning without requiring expensive
hardware.
• Cloud-Based: No local setup—just log in and start coding.
• Collaboration: Share notebooks via links for easy collaboration.
5. Cursor AI
An AI-powered code editor designed for enhanced productivity with machine
learning assistance.
Large projects,
VS Code Lightweight with advanced features
debugging
Enterprise-level,
PyCharm Professional IDE with deep features
complex applications
AI-assisted coding,
Cursor AI AI-enhanced code suggestions
productivity
Academic research,
Spyder MATLAB-like interface
scientific computing
Summary
Choosing the right tool depends on your project size, complexity, and hardware
needs. For most data science workflows, Anaconda with Jupyter Notebook offers
the best balance of simplicity, flexibility, and power.