[go: up one dir, main page]

0% found this document useful (0 votes)
55 views5 pages

DHDJDJDJ

Download as txt, pdf, or txt
Download as txt, pdf, or txt
Download as txt, pdf, or txt
You are on page 1/ 5

{

"cells": [
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# COMP2200/COMP6200 Week 2\n",
"\n",
"Topics:\n",
"- Git Review\n",
"- Python Pandas, Series and DataFrames\n",
"- Getting Data\n",
"- Reading data in Python"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## GIT\n",
"\n",
"- Distributed Version Control\n",
"- Why are we introducing it here?\n",
" \n",
" - You are writing code - so you should be using VC\n",
" - Provides an audit trail of your work on a project\n",
" - You will be doing a group project, key to collaboration\n",
" \n",
"There are lots of [guides to Git](http://rogerdudler.github.io/git-guide/)
that will show you the basic commands and [explain how Git
works](https://www.atlassian.com/git/tutorials/what-is-git) and [let you try
commands](https://try.github.io/). "
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"You can learn Git on the command line or using a GUI. Knowing the command
line basics is useful if you are ever using it remotely (on a server for example).
Usually, using a GUI is the best idea for a beginner. One reason is that Git is
quite complicated and it is easy to get yourself into a bit of a mess.\n",
"\n",
"There are many GUI tools:\n",
"- [Github Desktop](https://desktop.github.com/)\n",
"- [GitKraken](https://www.gitkraken.com/)\n",
"- [Sourcetree](https://www.sourcetreeapp.com/)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"# Code hosting Repositories\n",
"\n",
"- [Github](https://github.com/)\n",
"- [Bitbucked](https://bitbucket.org/product)\n",
"- [GitLab](https://about.gitlab.com/)"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Fermi Estimation\n",
"\n",
"* The task we did last week (how much toilet paper consumed in Australia per
year) is an example of an Estimation Problem\n",
"* Fermi Estimation is a technique for making estimates of the _order of
magnitude_ of a result\n",
"* Not precise but tries to estimate to the nearest power of 10\n",
"* A good technique for working out whether a claimed result is reasonable\n",
"* Example: [Case Study: Foodstamp
Fraud](https://callingbullshit.org/case_studies/case_study_foodstamp_fraud.html)\
n",
"* What about Australia's [welfare fraud
problem](https://www.google.com/search?q=welfare+fraud+australia+billions)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"outputs": [
{
"data": {
"text/plain": [
"0.08"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# How big is Australia's welfare fraud problem?\n",
"pop = 25e6\n",
"on_welfare = 0.1 * pop\n",
"amount = 10000\n",
"\n",
"total = amount * on_welfare\n",
"\n",
"fraud = 2e9\n",
"fraud / total\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Finding Data\n",
"\n",
"A look at some places that could be good sources of data for DS projects.
What kind of data formats do they use? \n",
"\n",
"- [Data.Gov.au](https://data.gov.au/) - official publication channel for
Australian Govt. data\n",
"- [Australian Bureau of Statistics](https://abs.gov.au/) Census and other
survey data\n",
"- [kaggle.com](https://www.kaggle.com/datasets) runs Data Science & Machine
Learning competitions\n",
"- [Open Addresses](https://openaddresses.io)\n",
"- [Search for it](https://www.google.com.au/search?
client=safari&rls=en&q=open+data&ie=UTF-8&oe=UTF-8&gfe_rd=cr&ei=WXKKWYbgH-
3c8weDmYG4BA) \n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Data Formats\n",
"\n",
"What formats will you find? \n",
"- Excel/CSV - easy to read as long as the data is a simple table (but what if
it isn't?)\n",
"- XML (eg. KML for geographical data)\n",
"- JSON\n",
"- PDF, Word, etc - often interesting data is locked in inappropriate formats\
n",
"\n",
"Eg. see [Data sets available from Transport for
NSW](https://opendata.transport.nsw.gov.au/search/type/dataset?sort_by=changed) -
allows you to filter by data format"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"\n",
"Issues with data once you find it:\n",
"- Missing values for some fields in some records\n",
"- Values in fields are not consistent - eg. response to \"What language do you
speak at home?\" or \"What town were you born in?\"\n",
"- incomplete records - need to link to other data sources\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"slideshow": {
"slide_type": "slide"
}
},
"source": [
"## Data Cleansing\n",
"\n",
"[What is Data Cleansing?](https://www.talend.com/resources/what-is-data-
cleansing/)\n",
"\n",
"[10 Best Data Cleaning Tools](https://www.unite.ai/10-best-data-cleaning-
tools/)\n",
"\n",
"[OpenRefine](http://openrefine.org/) is a tool for pre-processing data
interactivly. It can read various formats of data and help you generate consistent
tabular data that can feed into your analysis. From their home page, OpenRefine
can:\n",
"\n",
"- Import data in various formats\n",
"- Explore datasets in a matter of seconds\n",
"- Apply basic and advanced cell transformations\n",
"- Deal with cells that contain multiple values\n",
"- Create instantaneous links between datasets\n",
"- Filter and partition your data easily with regular expressions\n",
"- Use named-entity extraction on full-text fields to automatically identify
topics\n",
"- Perform advanced data operations with the General Refine Expression
Language\n",
"\n",
"[An example](https://blog.ouseful.info/2013/05/03/a-wrangling-example-with-
openrefine-making-ready-data/) of using OpenRefine to create a useable dataset.\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"celltoolbar": "Slideshow",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

You might also like