[go: up one dir, main page]

0% found this document useful (0 votes)
91 views59 pages

01-02 Data Analysis and Data Mining-Lect1

This is my file
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
91 views59 pages

01-02 Data Analysis and Data Mining-Lect1

This is my file
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Lecture 1 Overview

Data Analysis and Data Mining


Dr.李晓瑜 Xiaoyu Li
Email:xiaoyu33521@163.com
http://blog.sciencenet.cn/u/uestc2014xiaoyu
2017-Spring

SunData Group http://www.sundatagroup.org/


School of Information and Software Engineering, UESTC
1
Copyright © 2017 by Xiaoyu Li.
Content(3H)
 1.1 What’s big data?
 1.2 Overview of data analysis
 1.3 Overview of data mining
 1.4 Make requirement for different professional
applications

2
Copyright © 2017 by Xiaoyu Li.
Reference
 Text Book
 Data Mining, Jiawei Han, Micheline Kamber
and Jian Pei, Mechanical industry press(2012)
 Reference Book
1)Tamhane, Ajit C., and Dorothy D. Dunlop.
Statistics and Data Analysis: From Elementary to
Intermediate. Prentice Hall, 1999.
2)统计学习方法(李航)
 Couresa
1)Machine Learning(Andrew Ng)
2)Data Mining(Stanford)
3)Statistical Thinking and Data Analysis
(MIT)

3
Copyright © 2017 by Xiaoyu Li.
Target

 1 Know the characteristics of big data;

 2 Clear how to get the data analysis requirements;

 3 Know the differences and correlations between

data analysis and data mining.

4
Copyright © 2017 by Xiaoyu Li.
Big Data

5
Copyright © 2017 by Xiaoyu Li.
1.1 What’s big data?

6
Copyright © 2017 by Xiaoyu Li.
(1) Background

7
(2) Development
Media/Entertainm Healthcar
et e

DNA fMRI/ DTI Messenger Watch

Gene
BIG Sequence

Industry
DATA E-commerce

Sensor Manufacture Wall Mart: 2.5 PB/hour Stock Data

8
Copyright © 2017 by Xiaoyu Li. *Note: some pictures derived from internet
(3) Data Stream
Internet Surveillance

Spam
Filtering DATA Network Intrusion
Industry STREAM Mobile

Sensor Smart
Phone
*Note: some pictures derived from internet
9
Copyright © 2017 by Xiaoyu Li.
(4) Useful Applications

10
Copyright © 2017 by Xiaoyu Li.
(5) Big Data in Partners

11
Copyright © 2017 by Xiaoyu Li.
(6) Characteristics of Big Data

12
Copyright © 2017 by Xiaoyu Li.
(7) Big Data System Today

13
Copyright © 2017 by Xiaoyu Li.
Definition of big data

——By IBM

14
Copyright © 2017 by Xiaoyu Li.
(8) Big Data V.S. Cloud computing

15
Copyright © 2017 by Xiaoyu Li.
Cloud computing

16
Copyright © 2017 by Xiaoyu Li.
Cloud computing

 Cloud computing in its modern sense appeared


as early as 1996;

 The earliest known mention in a Compaq


internal document;
 The popularization of the term can be traced
to 2006 when Amazon.com introduced the
Elastic Compute Cloud

17
Copyright © 2017 by Xiaoyu Li.
(9) Data Source

18
Copyright © 2017 by Xiaoyu Li.
(10) Non-structures data
 Over 80% is Non-structured Data

19
Copyright © 2017 by Xiaoyu Li.
(11)Data Science Process

20
Copyright © 2017 by Xiaoyu Li.
(12) Issues of Big Data

Visualization
Pre-processing Analysis

Collection Mining Storage

21
Copyright © 2017 by Xiaoyu Li.
E.g. Visualization of big data

22
Copyright © 2017 by Xiaoyu Li.
E.g. Visualization of big data

23
Copyright © 2017 by Xiaoyu Li.
E.g. Visualization of big data

24
Copyright © 2017 by Xiaoyu Li.
E.g. Visualization of big data

25
Copyright © 2017 by Xiaoyu Li.
E.g. Visualization of big data

26
Copyright © 2017 by Xiaoyu Li.
E.g. Visualization of big data

27
Copyright © 2017 by Xiaoyu Li.
E.g. Real Data of Visualization

28
Copyright © 2017 by Xiaoyu Li.
E.g. Visual Framework

Reference 1 From Beijing Xiaoru Yuan


29
Copyright © 2017 by Xiaoyu Li.
Before data analysis and data mining

 Prerequisites
Probability Theory/Advanced Mathematics
Statistics/Programming/Database/Information Theory
 Technologies
A/B Testing/Crowdsourcing/Data Fusion and Integration
Genetic Algorithms/Machine Learning
Natural Language Processing
Signal Processing/Simulation
Time Series Analysis/Visualisation

30
Copyright © 2017 by Xiaoyu Li.
Before data analysis and data mining

31
Copyright © 2017 by Xiaoyu Li.
Before data analysis and data mining

 Reference

http://www.ibm.com/big-data/us/en/

http://www.gartner.com/technology/topics/big-data.jsp

http://blog.csdn.net/zouxy09/article/details/8775360

http://open.163.com/movie/2012/2/3/C/M8FH262H
J_M8FTVDQ3C.html

http://ocw.mit.edu/courses/sloan-
school-of-management/15-062-data-
mining-spring-2003/study-materials/

32
Copyright © 2017 by Xiaoyu Li.
1.2 Overview of data analysis
Data visualization to understand
the results of a data analysis.

33
Copyright © 2017 by Xiaoyu Li.
(1) Definition of data analysis

 Definition
A process of inspecting, cleaning, transforming, and
modeling data with the goal of discovering
useful information, suggesting conclusions, and
supporting decision-making.
 Clear goals;
 Based on collection of information;
 Base on some numerical distribution.

34
Copyright © 2017 by Xiaoyu Li.
(2) Terms of data

 integration

 data visualization

 data dissemination

 data modeling

35
Copyright © 2017 by Xiaoyu Li.
(3) Data analysis for Values

36
Copyright © 2017 by Xiaoyu Li.
(3) Data analysis for Values

 Business intelligence
 statistical applications
 descriptive statistics
 exploratory data analysis (EDA)
 confirmatory data analysis (CDA)
 Predictive analytics
 text analytics

37
Copyright © 2017 by Xiaoyu Li.
(4) The process of data analysis

 4.1 Data requirements


 4.2 Data collection
 4.3 Data processing
 4.4 Data cleaning
 4.5 Exploratory data analysis
 4.6 Modeling and algorithms
 4.7 Data product
 4.8 Communication
38
Copyright © 2017 by Xiaoyu Li.
E.g. Analytical activities of data users

39
Copyright © 2017 by Xiaoyu Li.
E.g. Analytical activities of data users

40
Copyright © 2017 by Xiaoyu Li.
Journal of data analysis

41
Copyright © 2017 by Xiaoyu Li.
1.3 Overview of data mining

42
Copyright © 2017 by Xiaoyu Li.
(1) Definition of data mining

 Data mining is defined as followed:


(the analysis step of the “Knowledge Discovery in
Databases” process, or KDD), an interdisciplinary
subfield of computer science, is the computational
process of discovering patterns in large data sets (“big
data”) involving methods at the intersection of artificial
intelligence, machine learning, statistics, and database
systems. The overall goal of the data mining process is to
extract information from a data set and transform it into
an understandable structure for further use.
— By Wikipedia
43
Copyright © 2017 by Xiaoyu Li.
(2) Continuous Innovation and development

 Supermarket application;

 Computer processing power;

 Disk storage;

 Statistical software.

44
Copyright © 2017 by Xiaoyu Li.
E.g. Example

45
Copyright © 2017 by Xiaoyu Li.
(3) Terms

 Data;

 Information;

 Knowledge;

 Data Warehouses.

46
Copyright © 2017 by Xiaoyu Li.
What can data mining do?

47
Copyright © 2017 by Xiaoyu Li.
(4) Challenges of Data Mining

– Scalability

– Curse of Dimensionality

– Mixed-type data

– Sensitivity of parameters

– Concept drift/ Evolving

– Privacy/Security

48
Copyright © 2017 by Xiaoyu Li.
(5) How does data mining work?

Four types of relationships are sought:


 Classes;

 Clusters;

 Associations;

 Sequential patterns.

49
Copyright © 2017 by Xiaoyu Li.
(6) Five major elements of data mining

 Extract, transform, and load transaction data


onto the data warehouse system.
 Store and manage the data in a
multidimensional database system.
 Provide data access to business analysts and
information technology professionals.
 Analyze the data by application software.
 Present the data in a useful format, such as a
graph or table.
50
Copyright © 2017 by Xiaoyu Li.
(7) Different levels of data mining

 Artificial neural networks


 Genetic algorithms
 Decision trees
 Nearest neighbor method
 Rule induction
 Data visualization

51
Copyright © 2017 by Xiaoyu Li.
Top-10 Algorithms (ICDM’06)
• #1: C4.5 (61 votes)
• #2: K-Means (60 votes)
• #3: SVM (58 votes)
• #4: Apriori (52 votes)
• #5: EM (48 votes)
• #6: PageRank (46 votes)
• #7: AdaBoost (45 votes)
• #7: kNN (45 votes)
• #7: Naive Bayes (45 votes)
• #10: CART (34 votes)
52
Copyright © 2017 by Xiaoyu Li.
1.4 Requirement for different applications

 Data requirements:
 The data necessary as inputs to the analysis are specified
based upon the requirements of those directing application.
 The general type of entity upon which the data will be
collected is referred to as an experimental unit.
 Specific variables regarding of different system (e.g., age
and income) may be specified and obtained.
 Data may be numerical or categorical (i.e., a text label for
numbers).

53
Copyright © 2017 by Xiaoyu Li.
Summarize of Lect.1

 The definition of big data.

 The conception of data analysis


and data mining.
 The data science process.

 What can data analysis and data


mining do?
54
Copyright © 2017 by Xiaoyu Li.
Homework 1
 Download 3-6 newest papers about data analysis

and data mining.

 Read them, summarize them, record these.

 Find 1-3 question, show me some ideas of it.

 Hand up the summarizations of

these papers and the question.

55
Copyright © 2017 by Xiaoyu Li.
Homework Requirement
 The original papers, put in a package.

 Summarizations are presented in a word

document.

 Send me the email format is : name + ID + title.

 Dead line: Mar.20 2017

 Email: xiaoyu33521@163.com

 Copy to Monitor.
56
Copyright © 2017 by Xiaoyu Li.
Highlight - Course Integrity
• All work submitted is to be your own. Cooperative study and
mutual aid are healthy learning methods and are strongly
encouraged. Just cite sources of anything you have copied,
summarized or discussed directly with another. It is cheating
to copy someone's work or allow someone to copy your work.
• It is cheating to copy material without giving credit.
Plagiarism will result in a course grade of F.

• When you find good ideas by other people, the best policy is
to summarize other work in your own words and cite their
work as the source for the principle you state. Citing resources
is not a sign of weakness of your own ideas, it is a sign that
you can do research and build on others' work.
57
Copyright © 2017 by Xiaoyu Li.
Contact Information

E-mail:
xiaoyu33521@163.com
Phone Number:
18108198701 (Chengdu)
Blog:
blog.sciencenet.cn/u/uestc2014xiaoyu

58
Copyright © 2017 by Xiaoyu Li.
59
Copyright © 2017 by Xiaoyu Li.

You might also like