CS 696 Intro to Big Data: Tools and Methods
Spring Semester, 2020
Doc 1 Introduction
Jan 23, 2020
Copyright ©, All rights reserved. 2020 SDSU & Roger Whitney, 5500
Campanile Drive, San Diego, CA 92182-7700 USA. OpenContent (http://
www.opencontent.org/openpub/) license defines the copyright on this
document.
Course Issues
http://www.eli.sdsu.edu/courses/index.html
Waitlist
Course Web Site
Wiki
Course Recordings
Prerequisites
This room
Grading
Books
Spark & Related Tools
Data Science
2
Waitlist - How to get into a Class
Add yourself to the course waitlist
Instructors can not
Add individuals to the class
See who is on the waitlist
Change your priority on the waitlist
3
Waitlist - How it works
Waitlist is a priority queue
When a seat in a class becomes available the top priority student is added
You can not be enrolled in two classes that meet at the same time
If wait list system adds you to a class, it will drop you from classes that meet at the same
time
First week of classes as students drop others are added
Second week of classes students are only added if instructor releases the seats
4
Can you add me to the Course?
Instructors can't select individual students to add to the course
5
Waitlist FAQ
Why not get a bigger room and admit everyone?
No first hard assignment to scare people
No Grader
Do you really want a 600 level class of 100 people?
This is the largest room of its type on campus
6
Waitlist FAQ
Will you be increasing the size of the class?
No
Why not?
No grader
New courses are a lot of work
Technology courses are a lot of work
7
Waitlist FAQ
Feb 4
Last day for regular students to add/drop classes
Open University students have lower priority than SDSU students
8
Waitlist FAQ
So what are my chances of adding this class?
Look up your position on the waitlist
What are the odds of that many people dropping the class
I can not see the waitlist
I have no idea how many people will drop
9
Grading
1 exam
4-6 assignments
Project
10
Course Website Demo
11
What are the Tools & Methods?
Programming language - Python
Programming Notebook
Visualization
scatter, box, violin, qq, line, density plots
errorbar, histogram, beeswarms
Statistics
mean, variance, quantiles, distributions
confidence intervals, correlation, coveriance
regression, goodness-of-fit, chi-squared test
Bayes theorem
Machine Learning
k-means, DBSCAN, Decision & Regression trees
Streaming - Kafka
Database - Cassandra
Hadoop, Spark, Pig, Mahout, etc.
12
What will be be doing
Installing programs
Python, Jupyter, Spark, Kafka, Cassandra
Writing Python, Java, Scala-Spark programs
Reports using Jupyter Notebooks
Analyzing data
Distributing data
Visualizing Data
Using Spark
Using Amazon Cloud
13
What will be be doing
~2 Weeks
Intro, Python
~5 weeks
Statistics, ML, NumPy, SciPy
Visualization
~3 weeks
Spark
~2 weeks
Kafka & Cassandra
14
Notebooks - Documentation, development
Python, Julia, R,
Other supported by community - Java, Fortran, Haskell, Ruby, Go, Scala, many more
Other notebook systems
Visualization
Python, Julia, R, Matlab
ML
Python (C), Julia, Matlab, R?
Spark - Large Data Sets Kafka - Streaming Data Cassandra - Data Storage
Scala Java Java
Java JVM languages Python
Python
Python Julia (Except for offsets) R - sort of
R Others - No R Client
Julia
15
Prerequisites
You will be installing software
Python
Jupyter
Some of these are more complex
Spark
on Windows than Unix/Mac OS
Kafka
Cassandra
Plotly
We will be doing some
Statistics
Math
Machine learning
16
Tasks - Install the Following
Jupyter via Anaconda & Conda with Python 3
http://jupyter.readthedocs.io/en/latest/install.html
Spark 2.4.4, Prebuild for Apache Hadoop 2.7
Unix/Linux/Mac OS
http://spark.apache.org/docs/latest/
Windows http://wiki.apache.org/hadoop/Hadoop2OnWindows
17
Books
Python Data Science Handbook: Essential Tools for Working with Data
Jake VanderPlas
O'Reilly Media
December 10, 2016
ISBN 9781491912058
Spark: The Definitive Guide
Matei Zaharia, Bill Chambers
February 2018
ISBN 9781491912218
18
Books
Course books are available for free on-line via SDSU library
Need SDSU Library account to access books off campus
Some people do not like reading books on-line
But if you need to save money it is available
May add chapters of other books as semester progresses
But on-line from books available on-line
19
Spark, Amazon
You will run Spark on Amazon’s cloud
You need to create an Amazon AWS account
Sign up for Amazon Educate account - $100 compute time for free
But you may incur some cost on Amazon
20
Data Science & Big Data
Very trendy
When topics become trendy in CS the terms become very vague
Big Data Analytics with Excel
Is Data Scientist A Useless Job Title?
21
Data Science
Data science is an interdisciplinary field about processes and systems to extract
knowledge or insights from data in various forms, either structured or
unstructured,[1][2] which is a continuation of some of the data analysis fields
such as statistics, data mining, and predictive analytics,[3] similar to Knowledge
Discovery in Databases (KDD)
Wikipedia
22
Data Science
Data Scientist (n.):
Person who is better at statistics than any software engineer and
better at software engineering than any statistician.
— Josh Wills (@josh_wills) May 3, 2012
23
Data Engineer
A software engineer that deals with data plumbing
Traditional database setup, Hadoop, Spark, etc.
Data analyst
A person who digs into data to surface insights,
but lacks the skills to do so at scale
They know how to use
Excel, Tableau and SQL
but can’t build a web app from scratch
24
Data Science
Science of transforming data into useful information by means of
Statistical and
Machine learning techniques
25
Data Science & Big Data
Big Data
Data Science with large datasets
No hard boundary between Big Data and medium data
Requires more data plumbing
26
Inconvenient Truth About Data Science
Data is never clean.
You will spend most of your time cleaning and preparing data.
95% of tasks do not require deep learning.
In 90% of cases generalized linear regression will do the trick.
Big Data is just a tool.
You should embrace the Bayesian approach.
No one cares how you did it.
Academia and business are two different worlds.
Presentation is key - be a master of Power Point.
All models are false, but some are useful.
There is no fully automated Data Science. You need to get your hands dirty.
27
Share of Respondents
0%
10%
20%
30%
40%
50%
60%
70%
SQL
Excel
Python
R
MySQL
Python: numpy, scipy, scikit-learn
ggplot
TOOLS
Microsoft SQL Server
Tableau
JavaScript
Matplotlib (Python)
Java
PostgreSQL
Oracle
D3
Homegrown analysis tools
Hive
28
Spark
Cloudera
Visual Basic/VBA
MongoDB
LANGUAGES, DATA PLATFORMS, ANALYTICS
Apache Hadoop
SAS
C++
PowerPivot
Tool: language, data platform, analytics
Scala
SQLite
C
Pig
Amazon RedShift
Weka
Hbase
Amazon Elastic MapReduce (EMR)
Perl
SPSS
Teradata
Share of Respondents
50K
100K
150K
200K
0K
SQL
Excel
Python
R
MySQL
Python: numpy, scipy, scikit-learn
ggplot
Microsoft SQL Server
Tableau
JavaScript
Matplotlib (Python)
SALARY MEDIAN AND IQR (US DOLLARS)
Java
PostgreSQL
Oracle
D3
Homegrown analysis tools
29
Hive
Spark
Cloudera
TOOLS: LANGUAGES, DATA PLATFORMS, ANALYTICS
Visual Basic/VBA
MongoDB
Tool: language, data platform, analytics
Apache Hadoop
SAS
C++
PowerPivot
Scala
SQLite
C
Pig
Amazon RedShift
Weka
Hbase
Amazon Elastic MapReduce (EMR)
Perl
SPSS
Teradata
30
Rule of Three
If you can not think of three things that might go wrong with your analysis
there is something wrong with your thinking
31
Data Science Verses Programming Jobs
Intuit Job Listing Worldwide Aug 22 2016
Data - 23
Software Engineer - 168
32
Data Science Programming Languages
Python Scala Java
R Julia C++
Matlab C
Javascript C#
SAS
Perl
Ruby
33
Features of Languages for Data Science
Interactive
Statistical, Machine Learning, Math libraries
Plays well with others
Supports computation
Simple syntax
Fast
34
Python
Wildly used Slow
Interactive Python 2.x verses Python 3.x
3/2
Lots of libraries
Threads do not scale
Plays well with other Global Interpreter Lock (GIL)
35
Julia
New language from MIT LLVM
Interactive & Fast Lisp style macros
Untyped & Typed Multiple dispatch
Designed for computation Designed for parallelism &
Distributed computation
f(x) = 2x + 4
Int32, Int64, Int128, BigInt
Statistical and Math libraries
Plays well with others
36
Java, Scala, Hadoop, Spark
Hadoop written in Java
Spark written in Scala
JVM languages (Java, Scala, Clojure, Groovy, JRuby, Jython)
Much more efficient on Hadoop & Spark
First access to new features
Scala
OO & Functional
Type inference
Far less verbose than Java
37