Lecture 1 Overview
Data Analysis and Data Mining
Dr.李晓瑜 Xiaoyu Li
Email:xiaoyu33521@163.com
http://blog.sciencenet.cn/u/uestc2014xiaoyu
2017-Spring
SunData Group http://www.sundatagroup.org/
School of Information and Software Engineering, UESTC
1
Copyright © 2017 by Xiaoyu Li.
Content(3H)
1.1 What’s big data?
1.2 Overview of data analysis
1.3 Overview of data mining
1.4 Make requirement for different professional
applications
2
Copyright © 2017 by Xiaoyu Li.
Reference
Text Book
Data Mining, Jiawei Han, Micheline Kamber
and Jian Pei, Mechanical industry press(2012)
Reference Book
1)Tamhane, Ajit C., and Dorothy D. Dunlop.
Statistics and Data Analysis: From Elementary to
Intermediate. Prentice Hall, 1999.
2)统计学习方法(李航)
Couresa
1)Machine Learning(Andrew Ng)
2)Data Mining(Stanford)
3)Statistical Thinking and Data Analysis
(MIT)
3
Copyright © 2017 by Xiaoyu Li.
Target
1 Know the characteristics of big data;
2 Clear how to get the data analysis requirements;
3 Know the differences and correlations between
data analysis and data mining.
4
Copyright © 2017 by Xiaoyu Li.
Big Data
5
Copyright © 2017 by Xiaoyu Li.
1.1 What’s big data?
6
Copyright © 2017 by Xiaoyu Li.
(1) Background
7
(2) Development
Media/Entertainm Healthcar
et e
DNA fMRI/ DTI Messenger Watch
Gene
BIG Sequence
Industry
DATA E-commerce
Sensor Manufacture Wall Mart: 2.5 PB/hour Stock Data
8
Copyright © 2017 by Xiaoyu Li. *Note: some pictures derived from internet
(3) Data Stream
Internet Surveillance
Spam
Filtering DATA Network Intrusion
Industry STREAM Mobile
Sensor Smart
Phone
*Note: some pictures derived from internet
9
Copyright © 2017 by Xiaoyu Li.
(4) Useful Applications
10
Copyright © 2017 by Xiaoyu Li.
(5) Big Data in Partners
11
Copyright © 2017 by Xiaoyu Li.
(6) Characteristics of Big Data
12
Copyright © 2017 by Xiaoyu Li.
(7) Big Data System Today
13
Copyright © 2017 by Xiaoyu Li.
Definition of big data
——By IBM
14
Copyright © 2017 by Xiaoyu Li.
(8) Big Data V.S. Cloud computing
15
Copyright © 2017 by Xiaoyu Li.
Cloud computing
16
Copyright © 2017 by Xiaoyu Li.
Cloud computing
Cloud computing in its modern sense appeared
as early as 1996;
The earliest known mention in a Compaq
internal document;
The popularization of the term can be traced
to 2006 when Amazon.com introduced the
Elastic Compute Cloud
17
Copyright © 2017 by Xiaoyu Li.
(9) Data Source
18
Copyright © 2017 by Xiaoyu Li.
(10) Non-structures data
Over 80% is Non-structured Data
19
Copyright © 2017 by Xiaoyu Li.
(11)Data Science Process
20
Copyright © 2017 by Xiaoyu Li.
(12) Issues of Big Data
Visualization
Pre-processing Analysis
Collection Mining Storage
21
Copyright © 2017 by Xiaoyu Li.
E.g. Visualization of big data
22
Copyright © 2017 by Xiaoyu Li.
E.g. Visualization of big data
23
Copyright © 2017 by Xiaoyu Li.
E.g. Visualization of big data
24
Copyright © 2017 by Xiaoyu Li.
E.g. Visualization of big data
25
Copyright © 2017 by Xiaoyu Li.
E.g. Visualization of big data
26
Copyright © 2017 by Xiaoyu Li.
E.g. Visualization of big data
27
Copyright © 2017 by Xiaoyu Li.
E.g. Real Data of Visualization
28
Copyright © 2017 by Xiaoyu Li.
E.g. Visual Framework
Reference 1 From Beijing Xiaoru Yuan
29
Copyright © 2017 by Xiaoyu Li.
Before data analysis and data mining
Prerequisites
Probability Theory/Advanced Mathematics
Statistics/Programming/Database/Information Theory
Technologies
A/B Testing/Crowdsourcing/Data Fusion and Integration
Genetic Algorithms/Machine Learning
Natural Language Processing
Signal Processing/Simulation
Time Series Analysis/Visualisation
30
Copyright © 2017 by Xiaoyu Li.
Before data analysis and data mining
31
Copyright © 2017 by Xiaoyu Li.
Before data analysis and data mining
Reference
http://www.ibm.com/big-data/us/en/
http://www.gartner.com/technology/topics/big-data.jsp
http://blog.csdn.net/zouxy09/article/details/8775360
http://open.163.com/movie/2012/2/3/C/M8FH262H
J_M8FTVDQ3C.html
http://ocw.mit.edu/courses/sloan-
school-of-management/15-062-data-
mining-spring-2003/study-materials/
32
Copyright © 2017 by Xiaoyu Li.
1.2 Overview of data analysis
Data visualization to understand
the results of a data analysis.
33
Copyright © 2017 by Xiaoyu Li.
(1) Definition of data analysis
Definition
A process of inspecting, cleaning, transforming, and
modeling data with the goal of discovering
useful information, suggesting conclusions, and
supporting decision-making.
Clear goals;
Based on collection of information;
Base on some numerical distribution.
34
Copyright © 2017 by Xiaoyu Li.
(2) Terms of data
integration
data visualization
data dissemination
data modeling
35
Copyright © 2017 by Xiaoyu Li.
(3) Data analysis for Values
36
Copyright © 2017 by Xiaoyu Li.
(3) Data analysis for Values
Business intelligence
statistical applications
descriptive statistics
exploratory data analysis (EDA)
confirmatory data analysis (CDA)
Predictive analytics
text analytics
37
Copyright © 2017 by Xiaoyu Li.
(4) The process of data analysis
4.1 Data requirements
4.2 Data collection
4.3 Data processing
4.4 Data cleaning
4.5 Exploratory data analysis
4.6 Modeling and algorithms
4.7 Data product
4.8 Communication
38
Copyright © 2017 by Xiaoyu Li.
E.g. Analytical activities of data users
39
Copyright © 2017 by Xiaoyu Li.
E.g. Analytical activities of data users
40
Copyright © 2017 by Xiaoyu Li.
Journal of data analysis
41
Copyright © 2017 by Xiaoyu Li.
1.3 Overview of data mining
42
Copyright © 2017 by Xiaoyu Li.
(1) Definition of data mining
Data mining is defined as followed:
(the analysis step of the “Knowledge Discovery in
Databases” process, or KDD), an interdisciplinary
subfield of computer science, is the computational
process of discovering patterns in large data sets (“big
data”) involving methods at the intersection of artificial
intelligence, machine learning, statistics, and database
systems. The overall goal of the data mining process is to
extract information from a data set and transform it into
an understandable structure for further use.
— By Wikipedia
43
Copyright © 2017 by Xiaoyu Li.
(2) Continuous Innovation and development
Supermarket application;
Computer processing power;
Disk storage;
Statistical software.
44
Copyright © 2017 by Xiaoyu Li.
E.g. Example
45
Copyright © 2017 by Xiaoyu Li.
(3) Terms
Data;
Information;
Knowledge;
Data Warehouses.
46
Copyright © 2017 by Xiaoyu Li.
What can data mining do?
47
Copyright © 2017 by Xiaoyu Li.
(4) Challenges of Data Mining
– Scalability
– Curse of Dimensionality
– Mixed-type data
– Sensitivity of parameters
– Concept drift/ Evolving
– Privacy/Security
48
Copyright © 2017 by Xiaoyu Li.
(5) How does data mining work?
Four types of relationships are sought:
Classes;
Clusters;
Associations;
Sequential patterns.
49
Copyright © 2017 by Xiaoyu Li.
(6) Five major elements of data mining
Extract, transform, and load transaction data
onto the data warehouse system.
Store and manage the data in a
multidimensional database system.
Provide data access to business analysts and
information technology professionals.
Analyze the data by application software.
Present the data in a useful format, such as a
graph or table.
50
Copyright © 2017 by Xiaoyu Li.
(7) Different levels of data mining
Artificial neural networks
Genetic algorithms
Decision trees
Nearest neighbor method
Rule induction
Data visualization
51
Copyright © 2017 by Xiaoyu Li.
Top-10 Algorithms (ICDM’06)
• #1: C4.5 (61 votes)
• #2: K-Means (60 votes)
• #3: SVM (58 votes)
• #4: Apriori (52 votes)
• #5: EM (48 votes)
• #6: PageRank (46 votes)
• #7: AdaBoost (45 votes)
• #7: kNN (45 votes)
• #7: Naive Bayes (45 votes)
• #10: CART (34 votes)
52
Copyright © 2017 by Xiaoyu Li.
1.4 Requirement for different applications
Data requirements:
The data necessary as inputs to the analysis are specified
based upon the requirements of those directing application.
The general type of entity upon which the data will be
collected is referred to as an experimental unit.
Specific variables regarding of different system (e.g., age
and income) may be specified and obtained.
Data may be numerical or categorical (i.e., a text label for
numbers).
53
Copyright © 2017 by Xiaoyu Li.
Summarize of Lect.1
The definition of big data.
The conception of data analysis
and data mining.
The data science process.
What can data analysis and data
mining do?
54
Copyright © 2017 by Xiaoyu Li.
Homework 1
Download 3-6 newest papers about data analysis
and data mining.
Read them, summarize them, record these.
Find 1-3 question, show me some ideas of it.
Hand up the summarizations of
these papers and the question.
55
Copyright © 2017 by Xiaoyu Li.
Homework Requirement
The original papers, put in a package.
Summarizations are presented in a word
document.
Send me the email format is : name + ID + title.
Dead line: Mar.20 2017
Email: xiaoyu33521@163.com
Copy to Monitor.
56
Copyright © 2017 by Xiaoyu Li.
Highlight - Course Integrity
• All work submitted is to be your own. Cooperative study and
mutual aid are healthy learning methods and are strongly
encouraged. Just cite sources of anything you have copied,
summarized or discussed directly with another. It is cheating
to copy someone's work or allow someone to copy your work.
• It is cheating to copy material without giving credit.
Plagiarism will result in a course grade of F.
• When you find good ideas by other people, the best policy is
to summarize other work in your own words and cite their
work as the source for the principle you state. Citing resources
is not a sign of weakness of your own ideas, it is a sign that
you can do research and build on others' work.
57
Copyright © 2017 by Xiaoyu Li.
Contact Information
E-mail:
xiaoyu33521@163.com
Phone Number:
18108198701 (Chengdu)
Blog:
blog.sciencenet.cn/u/uestc2014xiaoyu
58
Copyright © 2017 by Xiaoyu Li.
59
Copyright © 2017 by Xiaoyu Li.