"cells": [
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
"source": [
"# COMP2200/COMP6200 Week 2\n",
"- Git Review\n",
"- Python Pandas, Series and DataFrames\n",
"- Getting Data\n",
"- Reading data in Python"
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
"source": [
"## GIT\n",
"- Distributed Version Control\n",
"- Why are we introducing it here?\n",
" \n",
" - You are writing code - so you should be using VC\n",
" - Provides an audit trail of your work on a project\n",
" - You will be doing a group project, key to collaboration\n",
" \n",
"There are lots of [guides to Git](http://rogerdudler.github.io/git-guide/)
that will show you the basic commands and [explain how Git
works](https://www.atlassian.com/git/tutorials/what-is-git) and [let you try
commands](https://try.github.io/). "
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
"source": [
"You can learn Git on the command line or using a GUI. Knowing the command
line basics is useful if you are ever using it remotely (on a server for example).
Usually, using a GUI is the best idea for a beginner. One reason is that Git is
quite complicated and it is easy to get yourself into a bit of a mess.\n",
"There are many GUI tools:\n",
"- [Github Desktop](https://desktop.github.com/)\n",
"- [GitKraken](https://www.gitkraken.com/)\n",
"- [Sourcetree](https://www.sourcetreeapp.com/)"
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
"source": [
"# Code hosting Repositories\n",
"- [Github](https://github.com/)\n",
"- [Bitbucked](https://bitbucket.org/product)\n",
"- [GitLab](https://about.gitlab.com/)"
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
"source": [
"## Fermi Estimation\n",
"* The task we did last week (how much toilet paper consumed in Australia per
year) is an example of an Estimation Problem\n",
"* Fermi Estimation is a technique for making estimates of the _order of
magnitude_ of a result\n",
"* Not precise but tries to estimate to the nearest power of 10\n",
"* A good technique for working out whether a claimed result is reasonable\n",
"* Example: [Case Study: Foodstamp
"* What about Australia's [welfare fraud
"cell_type": "code",
"execution_count": 3,
"metadata": {
"slideshow": {
"slide_type": "slide"
"outputs": [
"data": {
"text/plain": [
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
"source": [
"# How big is Australia's welfare fraud problem?\n",
"pop = 25e6\n",
"on_welfare = 0.1 * pop\n",
"amount = 10000\n",
"total = amount * on_welfare\n",
"fraud = 2e9\n",
"fraud / total\n"
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
"source": [
"## Finding Data\n",
"A look at some places that could be good sources of data for DS projects.
What kind of data formats do they use? \n",
"- [Data.Gov.au](https://data.gov.au/) - official publication channel for
Australian Govt. data\n",
"- [Australian Bureau of Statistics](https://abs.gov.au/) Census and other
survey data\n",
"- [kaggle.com](https://www.kaggle.com/datasets) runs Data Science & Machine
Learning competitions\n",
"- [Open Addresses](https://openaddresses.io)\n",
"- [Search for it](https://www.google.com.au/search?
3c8weDmYG4BA) \n"
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
"source": [
"## Data Formats\n",
"What formats will you find? \n",
"- Excel/CSV - easy to read as long as the data is a simple table (but what if
it isn't?)\n",
"- XML (eg. KML for geographical data)\n",
"- JSON\n",
"- PDF, Word, etc - often interesting data is locked in inappropriate formats\
"Eg. see [Data sets available from Transport for
NSW](https://opendata.transport.nsw.gov.au/search/type/dataset?sort_by=changed) -
allows you to filter by data format"
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
"source": [
"Issues with data once you find it:\n",
"- Missing values for some fields in some records\n",
"- Values in fields are not consistent - eg. response to \"What language do you
speak at home?\" or \"What town were you born in?\"\n",
"- incomplete records - need to link to other data sources\n"
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
"source": [
"## Data Cleansing\n",
"[What is Data Cleansing?](https://www.talend.com/resources/what-is-data-
"[10 Best Data Cleaning Tools](https://www.unite.ai/10-best-data-cleaning-
"[OpenRefine](http://openrefine.org/) is a tool for pre-processing data
interactivly. It can read various formats of data and help you generate consistent
tabular data that can feed into your analysis. From their home page, OpenRefine
"- Import data in various formats\n",
"- Explore datasets in a matter of seconds\n",
"- Apply basic and advanced cell transformations\n",
"- Deal with cells that contain multiple values\n",
"- Create instantaneous links between datasets\n",
"- Filter and partition your data easily with regular expressions\n",
"- Use named-entity extraction on full-text fields to automatically identify
"- Perform advanced data operations with the General Refine Expression
"[An example](https://blog.ouseful.info/2013/05/03/a-wrangling-example-with-
openrefine-making-ready-data/) of using OpenRefine to create a useable dataset.\n",
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
"metadata": {
"celltoolbar": "Slideshow",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
"nbformat": 4,
"nbformat_minor": 2