Introduction to Data Science Course Outline
Introduction to Data Science Course Outline
Learning Outcomes
At the conclusion of the course, students should be able to:
▪ Describe what Data Science is and the skill sets needed to be a data scientist.
▪ Explain in basic terms what Statistical Inference means. Identify probability distributions
commonly used as foundations for statistical modeling. Fit a model to data.
▪ Use python to carry out basic statistical modeling and analysis.
▪ Explain the significance of exploratory data analysis (EDA) in data science. Apply basic
tools (plots, graphs, summary statistics) to carry out EDA.
1
▪ Describe the Data Science Process and how its components interact.
▪ Use APIs and other tools to scrap the Web and collect data.
▪ Apply EDA and the Data Science process in a case study.
▪ Apply basic machine learning algorithms (Linear Regression, k-Nearest Neighbors (k-NN),
k-means, Naive Bayes) for predictive modeling. Explain why Linear Regression and k-NN
are poor choices for Filtering Spam. Explain why Naive Bayes is a better alternative.
▪ Identify common approaches used for Feature Generation. Identify basic Feature Selection
algorithms (Filters, Wrappers, Decision Trees, Random Forests) and use in applications.
▪ Identify and explain fundamental mathematical and algorithmic ingredients that constitute a
Recommendation Engine (dimensionality reduction, singular value decomposition, principal
component analysis). Build their own recommendation system using existing components.
▪ Create effective visualization of given data (to communicate or persuade).
▪ Work effectively (and synergically) in teams on data science projects.
▪ Reason around ethical and privacy issues in data science conduct and apply ethical practices.
Prerequisites
Students are expected to have basic knowledge of algorithms and reasonable programming
experience and some familiarity with basic linear algebra (e.g., solution of linear systems and
eigenvalue/vector computation) and basic probability and statistics. If you are interested in taking
the course, but are not sure if you have the right background, talk to the instructor. You may still
be allowed to take the course if you are willing to put in the extra effort to fill in any gaps.
2
✓ Numeric and Scientific Computation: NumPy and SciPy
✓ SCIKIT-Learn: Machine Learning in Python
✓ PANDAS: Python Data Analysis Library
Data Science Ecosystem Installation
Integrated Development Environments (IDE)
✓ Web Integrated Development Environment (WIDE): Jupyter
Get Started with Python for Data Scientists
✓ Reading, Selecting Data, Filtering Data, Filtering Missing Values, Manipulating
Data, Sorting, Grouping Data, Rearranging Data, Ranking Data and Plotting
2. Data Exploration, Cleaning and Data visualization
▪ Exploratory Data Analysis (EDA)
▪ Data cleaning and preprocessing techniques
▪ Dealing with missing data and outliers
▪ Data Visualization
▪ Tools for data visualization (e.g., Matplotlib, Seaborn, ggplot2)
▪ Creating static and interactive visualizations
3. Statistical Concepts in Data Scienc
3.1 Descriptive statistics
▪ Introduction
▪ Descriptive statistics
▪ Exploratory Data Analysis
▪ Estimation
✓ Sample and Estimated Mean, Variance and Standard
3.2 Inferential statistics and hypothesis testing
▪ Introduction
▪ Statistical Inference
▪ Measuring the Variability in Estimates
✓ Point Estimates
✓ Confidence Intervals
▪ Hypothesis Testing
✓ Testing Hypotheses Using Confidence Intervals
3
4. Machine learning
▪ Introduction
▪ Supervised learning (e.g., decision trees, random forests, support vector machines)
▪ Unsupervised learning (e.g., clustering, dimensionality reduction)
▪ Evaluation of machine learning models
▪ Three Basic Machine Learning Algorithms
✓ Linear Regression
✓ k-Nearest Neighbors (k-NN)
✓ k-means
▪ Machine Learning Algorithm and Usage in Applications
5. Regression analysis and Regression:
▪ Introduction
▪ linear regression
✓ Simple linear regression
✓ Multiple & Polynomial regression
▪ Sparse model.
▪ Logistics regression
6. Unsupervised learning
▪ Introduction
▪ Clustering
✓ similarity and distances
✓ quality measures of clustering
7. Mining Social-Network Graphs- Social networks as graphs
▪ Clustering of graphs
▪ Direct discovery of communities in graphs
▪ Partitioning of graphs
▪ Neighborhood properties in graphs
8. Recommendation Systems: Building a User-Facing Data Product
▪ Algorithmic ingredients of a Recommendation Engine
▪ Dimensionality Reduction
▪ Singular Value Decomposition
▪ Principal Component Analysis
▪ Exercise: build your own recommendation system
4
9. Data Science and Ethical Issues
▪ Discussions on privacy, security, ethics
▪ A look back at Data Science
▪ Next-generation data scientists
Books
1. "Python for Data Analysis" by Wes McKinney "Data Science for Business" by Foster
Provost and Tom Fawcett
2. introduction to Data Science a Python approach to concepts, Techniques and
Applications, Igual, L;Seghi’, S. Springer, ISBN:978-3-319-50016-4
3. Data Analysis with Python A Modern Approach, David Taieb, Packt Publishing, ISBN-
9781789950069
4. Python Data Analysis, Second Ed., Armando Fandango, Packt Publishing, ISBN:
9781787127487
Software and Tools:
• Python (Jupyter Notebooks)
• R (optional)
• Data visualization tools (e.g., Matplotlib, Seaborn, ggplot2)
• Machine learning libraries (e.g., scikit-learn, TensorFlow, PyTorch)
Additional references and books related to the course:
• Jure Leskovek, Anand Rajaraman and Jeffrey Ullman. Mining of Massive Datasets. v2.1,
Cambridge University Press. 2014. (free online)
• Kevin P. Murphy. Machine Learning: A Probabilistic Perspective. ISBN 0262018020. 2013.
• Foster Provost and Tom Fawcett. Data Science for Business: What You Need to Know about
Data Mining and Data-analytic Thinking. ISBN 1449361323. 2013.
• Trevor Hastie, Robert Tibshirani and Jerome Friedman. Elements of Statistical Learning,
Second Edition. ISBN 0387952845. 2009. (free online)
• Avrim Blum, John Hopcroft and Ravindran Kannan. Foundations of Data Science. (Note:
this is a book currently being written by the three authors. The authors have made the first
draft of their notes for the book available online. The material is intended for a modern
theoretical course in computer science.)
• Mohammed J. Zaki and Wagner Miera Jr. Data Mining and Analysis: Fundamental Concepts
and Algorithms. Cambridge University Press. 2014.
• Jiawei Han, Micheline Kamber and Jian Pei. Data Mining: Concepts and Techniques, Third
Edition. ISBN 0123814790. 2011.