Lab 1: Preprocessing Using Python

This document provides an overview of various data preprocessing, analysis, and visualization techniques. It discusses using Python for preprocessing CSV files, different data cleaning and transformation methods, OLAP cube tools for aggregating and analyzing multi-dimensional data, the Apriori algorithm for association rule mining, decision trees for classification, k-means clustering, and Tableau for data visualization. Various concepts are defined, such as data types, OLAP operations, schemas for multi-dimensional data, and joins in Tableau. Example code snippets are also provided for common Python preprocessing tasks.

Uploaded by

PF 21 Disha Gidwani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

204 views5 pages

Lab 1: Preprocessing Using Python

Uploaded by

PF 21 Disha Gidwani

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

LAB 1: PREPROCESSING USING PYTHON

DATA PREPROCESSING: Data preprocessing can refer to manipulation or dropping of data

before it is used in order to ensure or enhance performance, and is an important step in the data
mining process.
DATA: CSV FILE, PROCESSING: PYTHON
CSV: COMMA SEPARATED VALUE FILE
1.imp lib
2.imp dataset
3.upload dataset
4.perform task or computations
dataset.head(10) = passes first 10 records
dataset.tail(15)= passes last 15 records
dataset.info() = types of column and memory they are utilizing and data type
dataset.describe()= statistical information (mean value)
dataset.shape = rows and columns
dataset.size= total number of elements from the data frame (rows*columns)
dataset.ndim= dimensions
dataset.isnull() = find out missing values
dataset.isnull() .sum() = missing values summary
dataset.isnull() .values.any() = true or false, depending if there is a missing value
dataset.fillna() = fills all the missing value with the specified number (disadvantage: text data
also gets filled with number data: inconsistent data)
dataset.dropna() = delete the missing values
Dropping duplicate values- change in shape
Dropping missing values- no change in shape

DATA CLEANING: Data cleansing or data cleaning is the process of detecting and correcting
corrupt or inaccurate records from a record set, table, or database and refers to identifying
incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or
deleting the dirty or coarse data.
DATA REDUNDANCY: is a condition created within a database or data storage technology in
which the same piece of data is held in two separate places.Data redundancy can occur by
accident but is also done deliberately for backup and recovery purposes.
DATA INTEGRATION : involves combining data residing in different sources and providing
users with a unified view of them. This process becomes significant in a variety of situations,
which include both commercial and scientific domains.
DATA TRANSFORMATION: is the process of converting data from one format or structure into
another format or structure. It is a fundamental aspect of most data integration and data
management tasks such as data wrangling, data warehousing, data integration and application
integration.
Different data types in data mining:
● Flat Files.
● Relational Databases.
● DataWarehouse.
● Transactional Databases.
● Multimedia Databases.
● Spatial Databases.
NOISY DATA: is meaningless data. The term has often been used as a synonym for corrupt
data. However, its meaning has expanded to include any data that cannot be understood and
interpreted correctly by machines, such as unstructured text.

LAB 2: OLAB CUBE TOOL

OLAB: Online Analytical Processing (OLAP) is a category of software that allows users to
analyze information from multiple database systems at the same time. Group, Aggregate and
Join data.

SCHEMAS:
STAR SCHEMA: (frequently used): It is said to be star as its physical model resembles to
the star shape having a fact table at its center and the dimension tables at its peripheral
representing the star’s points.
SNOWFLAKE SCHEMA: the centralized fact table is connected to multiple dimensions. In
the snowflake schema, dimensions are present in a normalized form in multiple related tables.
FACT CONSTELLATION SCHEMA: It is a collection of multiple fact tables having some
common dimension tables. It can be viewed as a collection of several star schemas and hence,
also known as Galaxy schema.
DIFFERENT OLAP OPERATIONS:
DRILL UP (ROLL UP) : Summarize data by climbing up hierarchy or by dimension reduction
DRILL DOWN (ROLL DOWN): Moving down in the concept hierarchy
SLICE AND DICE: Project and select particular dimension
PIVOT: Re-orient the cube about its axis , visualization 3D to series of 2D planes

Dimensional table(primary key)

Fact table(foreign key and statical)
LAB 3: APRIORI ALGORITHM

APRIORI ALGORITHM: Apriori algorithm refers to the algorithm which is used to calculate the
association rules between objects. It means how two or more objects are related to one another.
In other words, we can say that the apriori algorithm is an association rule leaning that analyzes
that people who bought product A also bought product B.
SUPPORT: refers to items frequency of occurrence
CONFIDENCE: is conditional Property
ASSOCIATION RULE MINING: Association rule mining finds interesting associations and
relationships among large sets of data items. This rule shows how frequently a itemset occurs in
a transaction. MARKET BASED ANALYSIS. AIS, SETM, APRIORI, variation of the latter
All subsets of a frequent itemset must be frequent(Apriori property).
If an itemset is infrequent, all its supersets will be infrequent.
Step 1. Computing the support for each individual item
Step 2. Deciding on the support threshold
Step 3. Selecting the frequent items
Step 4. Finding the support of the frequent itemsets
Step 5. Repeat for larger sets
Step 6. Generate Association Rules and compute confidence
Step 7. Compute lift

LAB 4: DECISION TREE

DECISION TREE: is a structure that includes a root node, branches, and leaf nodes. Each
internal node denotes a test on an attribute, each branch denotes the outcome of a test, and
each leaf node holds a class label.
It has a dataset- predictor variable
class variable (binary classification)
Attributes: Information Gain; Entropy ; Gini Index

https://www.geeksforgeeks.org/decision-tree-introduction-example/

LAB 5: K-MEANS

It is used to solve the clustering problems in machine learning or data science.

It is an iterative algorithm that divides the unlabeled dataset into k different clusters in
such a way that each dataset belongs to only one group that has similar properties.

K-mean: Hardcoded value or csv

Number of clusters= number of centroids
1. Specify number of clusters K.
2. Initialize centroids by first shuffling the dataset and then randomly selecting K
data points for the centroids without replacement.
3. Keep iterating until there is no change to the centroids. i.e assignment of data
points to clusters isn’t changing.
● Compute the sum of the squared distance between data points and all
centroids.
● Assign each data point to the closest cluster (centroid).
● Compute the centroids for the clusters by taking the average of the all data
points that belong to each cluster.

LAB 6: TABLEAU

TABLEAU: DATA VISUALIZATION TOOL , quick calculation , interactive dashboards, big data ,
manual effort , no auto refreshing of reports
DATA TYPES: String, integer, boolean , data values, cluster values
FREQUENCY COUNT: finding how frequent individual value occurs in columns
PARETO ANALYSIS: A Pareto chart is a type of chart that contains both bars and a line graph,
where individual values are represented in descending order by bars, and the ascending
cumulative total is represented by the line.
HISTOGRAM: Information about range of values in which most of the values falls
USED FOR: Finance, Banking, Healthcare
SERVICES: TABLEAU Desktop, prep, creator , reader, public, viewer
INNER JOIN: Resultant table contains values have matches in both the tables
LEFT JOIN: Resultant table contains all values from left table and corresponding matches from
right table
RIGHT JOIN: values from left table and corresponding matches from right table
UNION: Method for combining tables, not a type of join

Agile Technologies-Notes
No ratings yet
Agile Technologies-Notes
16 pages
Memory Based Reasoning - BIA
100% (1)
Memory Based Reasoning - BIA
19 pages
Bca Notes
No ratings yet
Bca Notes
8 pages
Unit-1 Basics of Algorithms and Mathematics
No ratings yet
Unit-1 Basics of Algorithms and Mathematics
47 pages
JAVA Sample Questions For Practice (II CSE - A' & II IT - B')
No ratings yet
JAVA Sample Questions For Practice (II CSE - A' & II IT - B')
5 pages
Data Mining: Concepts and Techniques: Jiawei Han and Micheline Kamber
No ratings yet
Data Mining: Concepts and Techniques: Jiawei Han and Micheline Kamber
46 pages
Data Mining & Business Intelligence (2170715) : Unit-5 Concept Description and Association Rule Mining
No ratings yet
Data Mining & Business Intelligence (2170715) : Unit-5 Concept Description and Association Rule Mining
39 pages
Data Engineering UNIT-1
100% (1)
Data Engineering UNIT-1
14 pages
Unit 2 - Knowledge Delivery
No ratings yet
Unit 2 - Knowledge Delivery
31 pages
BD Problem Solving - I
No ratings yet
BD Problem Solving - I
2 pages
Lesson Plan: Data Warehousing and Data Mining
No ratings yet
Lesson Plan: Data Warehousing and Data Mining
1 page
Data Mining Models - GeeksforGeeks
No ratings yet
Data Mining Models - GeeksforGeeks
4 pages
Unit-3-Greedy Method PDF
No ratings yet
Unit-3-Greedy Method PDF
22 pages
Unit4 Datascience
No ratings yet
Unit4 Datascience
43 pages
PHY 206 Lecture 06
No ratings yet
PHY 206 Lecture 06
283 pages
Unit - 3
No ratings yet
Unit - 3
42 pages
Unit Iv
No ratings yet
Unit Iv
8 pages
Object Oriented Analysis and Design - Syllabus
No ratings yet
Object Oriented Analysis and Design - Syllabus
1 page
Unit V
No ratings yet
Unit V
13 pages
16 Mark Questions OOAD
100% (2)
16 Mark Questions OOAD
9 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
4 pages
Internal Product Attribute Measurement: Size
No ratings yet
Internal Product Attribute Measurement: Size
70 pages
Unit 3 - Data Mining - WWW - Rgpvnotes.in PDF
No ratings yet
Unit 3 - Data Mining - WWW - Rgpvnotes.in PDF
10 pages
Chapter 19 Quality Concepts
No ratings yet
Chapter 19 Quality Concepts
28 pages
DWDM Unit 1
No ratings yet
DWDM Unit 1
103 pages
OOSE Unit 1 Notes
No ratings yet
OOSE Unit 1 Notes
21 pages
Data Warehousing and Data Mining Syllabus
No ratings yet
Data Warehousing and Data Mining Syllabus
2 pages
Software Metrics-3
No ratings yet
Software Metrics-3
19 pages
AWT Controls
100% (1)
AWT Controls
6 pages
Data Mining & Warehousing 01
No ratings yet
Data Mining & Warehousing 01
53 pages
DOC-20231118-WA0008new Unit 3
No ratings yet
DOC-20231118-WA0008new Unit 3
15 pages
JNTUA MCA V Semester R17 Syllabus
No ratings yet
JNTUA MCA V Semester R17 Syllabus
24 pages
Lesson Plan F1.1-DMDW
No ratings yet
Lesson Plan F1.1-DMDW
3 pages
Sample Questions:: Section I: Subjective Questions
100% (2)
Sample Questions:: Section I: Subjective Questions
7 pages
Chapter 3 SE (Agile) Notes
No ratings yet
Chapter 3 SE (Agile) Notes
12 pages
ccs346 Eda
No ratings yet
ccs346 Eda
2 pages
6 1 Mining Complex Data
No ratings yet
6 1 Mining Complex Data
69 pages
Unit 5
No ratings yet
Unit 5
104 pages
Software Engineering Notes (Unit-III)
No ratings yet
Software Engineering Notes (Unit-III)
21 pages
Important Questions of SE: Chapter 1:-Introduction To Software and Software Engineering
100% (1)
Important Questions of SE: Chapter 1:-Introduction To Software and Software Engineering
4 pages
Frame-Based Expert Systems
No ratings yet
Frame-Based Expert Systems
50 pages
BI UNIT-II Chp01 (Mathematical Models For Decision Making)
No ratings yet
BI UNIT-II Chp01 (Mathematical Models For Decision Making)
9 pages
ds4015 Big Data Analytics Vignesh K Notes
No ratings yet
ds4015 Big Data Analytics Vignesh K Notes
146 pages
AoA Important Question
100% (1)
AoA Important Question
3 pages
Mean Stack Technologies Lab Record
No ratings yet
Mean Stack Technologies Lab Record
49 pages
Unit V Easy To Learn
No ratings yet
Unit V Easy To Learn
21 pages
Unit 4 - Data Mining - WWW - Rgpvnotes.in
No ratings yet
Unit 4 - Data Mining - WWW - Rgpvnotes.in
12 pages
DWM Unit 1
No ratings yet
DWM Unit 1
34 pages
DM Chapter 3 Data Preprocessing
No ratings yet
DM Chapter 3 Data Preprocessing
76 pages
SPM-UNIT-1 Jntuh r18 Notes
No ratings yet
SPM-UNIT-1 Jntuh r18 Notes
38 pages
III Year V Sem Cs6503 Theory of Computation
No ratings yet
III Year V Sem Cs6503 Theory of Computation
44 pages
Data Analytics Lab File Rohit
No ratings yet
Data Analytics Lab File Rohit
23 pages
Unit-2 Solution
No ratings yet
Unit-2 Solution
22 pages
Software Risk, Configuration Management
No ratings yet
Software Risk, Configuration Management
35 pages
Convolution Neural Networks U2
No ratings yet
Convolution Neural Networks U2
24 pages
Introduction to Linux: Installation and Programming
From Everand
Introduction to Linux: Installation and Programming
N. B. Venkateswarlu
No ratings yet
Textbook of Engineering Chemistry
From Everand
Textbook of Engineering Chemistry
C. Parameswara Murthy
No ratings yet
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
From Everand
The Datadog Handbook: A Guide to Monitoring, Metrics, and Tracing
Robert Johnson
No ratings yet
Week2 2
No ratings yet
Week2 2
25 pages
Chapter 3
No ratings yet
Chapter 3
50 pages
No Et Moi SDF Essay For Oxford
No ratings yet
No Et Moi SDF Essay For Oxford
3 pages
Lesson 21
No ratings yet
Lesson 21
7 pages
9 Sinaq 8
No ratings yet
9 Sinaq 8
2 pages
Searchtool - @bolivia
No ratings yet
Searchtool - @bolivia
23 pages
Call To Tauheed - Download Exclusive Islamic Video Lectures
No ratings yet
Call To Tauheed - Download Exclusive Islamic Video Lectures
11 pages
Aris Saefullah, Cyberdakwah Sebagai Media Alternatif Dakwah
No ratings yet
Aris Saefullah, Cyberdakwah Sebagai Media Alternatif Dakwah
23 pages
Yashachi Gurukilli-Complete (Marathi)
100% (3)
Yashachi Gurukilli-Complete (Marathi)
408 pages
Chapter 2 - Knowing Rizal
No ratings yet
Chapter 2 - Knowing Rizal
3 pages
Boston College Pre-Practicum Lesson Plan Template: Three Tiers of Vocabulary
No ratings yet
Boston College Pre-Practicum Lesson Plan Template: Three Tiers of Vocabulary
6 pages
"Building A Data Warehouse": Name: Hanar Ahmed Star Second Stage C2 (2019-2020)
No ratings yet
"Building A Data Warehouse": Name: Hanar Ahmed Star Second Stage C2 (2019-2020)
5 pages
Thesis Statement For Critical Lens Essay
100% (3)
Thesis Statement For Critical Lens Essay
7 pages
Complex Word Stress and Intonation: Group 5: Amelya Putri .A. Oktya Putri Bungsu Yogie Alfajar M
100% (2)
Complex Word Stress and Intonation: Group 5: Amelya Putri .A. Oktya Putri Bungsu Yogie Alfajar M
18 pages
Grammar b2 So Such Enough Too
No ratings yet
Grammar b2 So Such Enough Too
2 pages
Chat
No ratings yet
Chat
16 pages
Mar19 - q3 Darning
No ratings yet
Mar19 - q3 Darning
2 pages
Sacks-Sentence-Completion-Test Questionnaire
No ratings yet
Sacks-Sentence-Completion-Test Questionnaire
2 pages
Ansys-Product-Reference-Table-Startup-Program-Rev-9-11-23 - 1 1
No ratings yet
Ansys-Product-Reference-Table-Startup-Program-Rev-9-11-23 - 1 1
2 pages
661
No ratings yet
661
83 pages
AY2425 G06 ATS Admission Exams Sample
No ratings yet
AY2425 G06 ATS Admission Exams Sample
19 pages
Conic Section (Question Paper)
No ratings yet
Conic Section (Question Paper)
5 pages
PURCOM PRELIM 1 Nature and Elements of Communication Educational Presentation
No ratings yet
PURCOM PRELIM 1 Nature and Elements of Communication Educational Presentation
32 pages
Jesus According To Jesus
No ratings yet
Jesus According To Jesus
16 pages
Messiah Ben Joseph in The Book of Psalms
No ratings yet
Messiah Ben Joseph in The Book of Psalms
20 pages
Bruner P. May Mcclure M. Steinberger: Lecture Notes in Mathematics
No ratings yet
Bruner P. May Mcclure M. Steinberger: Lecture Notes in Mathematics
199 pages
Lecture 6 Morphology
No ratings yet
Lecture 6 Morphology
66 pages
API Concepts (V5R2)
No ratings yet
API Concepts (V5R2)
25 pages
Kapil
No ratings yet
Kapil
5 pages
Automata ch1
No ratings yet
Automata ch1
66 pages
ISC Poems
100% (1)
ISC Poems
17 pages
Homework Week 3 - IT
No ratings yet
Homework Week 3 - IT
3 pages

Lab 1: Preprocessing Using Python

Uploaded by

Lab 1: Preprocessing Using Python

Uploaded by

LAB 1: PREPROCESSING USING PYTHON

DATA PREPROCESSING: Data preprocessing can refer to manipulation or dropping of data

LAB 2: OLAB CUBE TOOL

Dimensional table(primary key)

LAB 4: DECISION TREE

It is used to solve the clustering problems in machine learning or data science.

K-mean: Hardcoded value or csv

You might also like