Python for Data
Science
Poverty GP Summer University
Welcome!
July 15-19, 2019 Here are t
he m aterials
http://github.com/worldbank/Python-for-Data-Science/
Participant outcomes
With no prior coding skills assumed, participants should be able
to:
• access and combine a diverse set of datasets;
• conduct data exploration and visualization;
• utilize Python libraries for geospatial data and machine
learning;
• self-teach next steps.
Programming shares many concepts from
everyday life.
Guess the output (1)
Ingredients Method
1. Mix all ingredients
• Half cup butter 2. Knead thoroughly
• Half cup cream 3. Form into 20 balls.
• 2.5 cups flour 4. For each ball:
• 1 t. salt • Spread flour on cloth
• 1 T. sugar • Roll ball in circle with rolling pin
• 4 cups riced potatoes • Fry on griddle
(cold) • Flip and fry other side
Credit: Think Python!
Lefse, Norwegian pancakes (makes 20) Allen Downey
Key elements of programming
• A vocabulary of words, abbreviations and symbols.
• Rules about what can be said and where – their syntax.
• A sequence of operations to be performed in order.
• Repetition of some operations (loops) or logical tests (conditions)
• Sometimes, a reference to procedures defined elsewhere (functions)
Credit: Think Python!
Allen Downey
Specialized syn
tax, loops, fun
logic are all co ctions and
mmon ways of
other domains thinking in
(like cooking).
Ingredients Method
1. Mix all ingredients
• Half cup butter 2. Knead thoroughly
• Half cup cream 3. Form into 20 balls.
• 2.5 cups flour 4. For each ball:
• 1 t. salt • Spread flour on cloth
• 1 T. sugar • Roll ball in circle with rolling pin
• 4 cups riced potatoes • Fry on griddle
(cold) • Flip and fry other side
Guess the output (2)
loop list elements of syntax
text
syntax (indentation)
Data science - two popular representations:
Computer
science Math & stats
Domain expertise
Source: IMF /
Doug Laney
Why Python for Data Science?
Why Python for data science?
Guido Van Rossum – the Zen of Python:
Python’s Benevolent Dictator for Life
Why Python for data science?
Guido Van Rossum – the Zen of Python:
Whitespace instead of symbols
• tabs, indentation and line-breaks matter
• code remains uncluttered
Variable types determined automatically
• no need to declare the type of your variables
before assigning values
Intuitive grammar
• PEP8: style guide
Python’s Benevolent Dictator for Life
Three advantages:
1. Python is
popular
• Large user community
• Well-maintained libraries
• Online guidance
(StackOverflow)
2. Easy to learn and share
WHY PEOPLE LIKE IT:
• Code is intuitive and
expressive (compare C++)
• Suited to large quantities of
data
• Transparent, reproducible
research through Jupyter
Notebooks
3. Thriving ecosystem of tools
Modeling Evaluate
Data
science
Get data Clean data and and
work-flow analysis present
• BeautifulSoup
Example
libraries • mySQL client • Pandas • Numpy • Jupyter
• API clients • Geopandas • scipy Notebook
(Twitter, ESRI, • Rasterio • statsmodel • Matplotlib
OSMNx…) • SciKitLearn • Flask
Housekeeping
Course outline
Day 1 Variables, data structures, logic, functions.
Day 2 Manipulating large tabular data (Numpy, Pandas), plotting.
Day 3 Web data (APIs), geospatial, machine learning.
Day 4 Call detail records, natural language processing
Housekeeping
Start time Please arrive for 9am start!
Format Lectures (click along)
Labs (your time to write code and read resources)
Coffee and lunch breaks Approx 10.45 - 11.00am, 12.30-2.00pm, 3.30—3.45pm
Requirements Bring your laptop with full charge, working wifi, and a Google log-in
Help your neighbors if they’re stuck!
Getting started
GitHub repository: github.com/worldbank/Python-for-Data-Science/
First exercise
Scroll down on GitHub ‘day_1’ page, click the link for ‘0_notebooks_intro’
Starting with Colab
• Ensure you’re logged on to your Google account
• Click ‘connect’
• De-select ‘reset all runtimes’ and click ‘run anyway’