Lab 1: Preprocessing Using Python
Lab 1: Preprocessing Using Python
DATA CLEANING: Data cleansing or data cleaning is the process of detecting and correcting
corrupt or inaccurate records from a record set, table, or database and refers to identifying
incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or
deleting the dirty or coarse data.
DATA REDUNDANCY: is a condition created within a database or data storage technology in
which the same piece of data is held in two separate places.Data redundancy can occur by
accident but is also done deliberately for backup and recovery purposes.
DATA INTEGRATION : involves combining data residing in different sources and providing
users with a unified view of them. This process becomes significant in a variety of situations,
which include both commercial and scientific domains.
DATA TRANSFORMATION: is the process of converting data from one format or structure into
another format or structure. It is a fundamental aspect of most data integration and data
management tasks such as data wrangling, data warehousing, data integration and application
integration.
Different data types in data mining:
● Flat Files.
● Relational Databases.
● DataWarehouse.
● Transactional Databases.
● Multimedia Databases.
● Spatial Databases.
NOISY DATA: is meaningless data. The term has often been used as a synonym for corrupt
data. However, its meaning has expanded to include any data that cannot be understood and
interpreted correctly by machines, such as unstructured text.
OLAB: Online Analytical Processing (OLAP) is a category of software that allows users to
analyze information from multiple database systems at the same time. Group, Aggregate and
Join data.
SCHEMAS:
STAR SCHEMA: (frequently used): It is said to be star as its physical model resembles to
the star shape having a fact table at its center and the dimension tables at its peripheral
representing the star’s points.
SNOWFLAKE SCHEMA: the centralized fact table is connected to multiple dimensions. In
the snowflake schema, dimensions are present in a normalized form in multiple related tables.
FACT CONSTELLATION SCHEMA: It is a collection of multiple fact tables having some
common dimension tables. It can be viewed as a collection of several star schemas and hence,
also known as Galaxy schema.
DIFFERENT OLAP OPERATIONS:
DRILL UP (ROLL UP) : Summarize data by climbing up hierarchy or by dimension reduction
DRILL DOWN (ROLL DOWN): Moving down in the concept hierarchy
SLICE AND DICE: Project and select particular dimension
PIVOT: Re-orient the cube about its axis , visualization 3D to series of 2D planes
APRIORI ALGORITHM: Apriori algorithm refers to the algorithm which is used to calculate the
association rules between objects. It means how two or more objects are related to one another.
In other words, we can say that the apriori algorithm is an association rule leaning that analyzes
that people who bought product A also bought product B.
SUPPORT: refers to items frequency of occurrence
CONFIDENCE: is conditional Property
ASSOCIATION RULE MINING: Association rule mining finds interesting associations and
relationships among large sets of data items. This rule shows how frequently a itemset occurs in
a transaction. MARKET BASED ANALYSIS. AIS, SETM, APRIORI, variation of the latter
All subsets of a frequent itemset must be frequent(Apriori property).
If an itemset is infrequent, all its supersets will be infrequent.
Step 1. Computing the support for each individual item
Step 2. Deciding on the support threshold
Step 3. Selecting the frequent items
Step 4. Finding the support of the frequent itemsets
Step 5. Repeat for larger sets
Step 6. Generate Association Rules and compute confidence
Step 7. Compute lift
DECISION TREE: is a structure that includes a root node, branches, and leaf nodes. Each
internal node denotes a test on an attribute, each branch denotes the outcome of a test, and
each leaf node holds a class label.
It has a dataset- predictor variable
class variable (binary classification)
Attributes: Information Gain; Entropy ; Gini Index
https://www.geeksforgeeks.org/decision-tree-introduction-example/
LAB 5: K-MEANS
It is an iterative algorithm that divides the unlabeled dataset into k different clusters in
such a way that each dataset belongs to only one group that has similar properties.
LAB 6: TABLEAU
TABLEAU: DATA VISUALIZATION TOOL , quick calculation , interactive dashboards, big data ,
manual effort , no auto refreshing of reports
DATA TYPES: String, integer, boolean , data values, cluster values
FREQUENCY COUNT: finding how frequent individual value occurs in columns
PARETO ANALYSIS: A Pareto chart is a type of chart that contains both bars and a line graph,
where individual values are represented in descending order by bars, and the ascending
cumulative total is represented by the line.
HISTOGRAM: Information about range of values in which most of the values falls
USED FOR: Finance, Banking, Healthcare
SERVICES: TABLEAU Desktop, prep, creator , reader, public, viewer
INNER JOIN: Resultant table contains values have matches in both the tables
LEFT JOIN: Resultant table contains all values from left table and corresponding matches from
right table
RIGHT JOIN: values from left table and corresponding matches from right table
UNION: Method for combining tables, not a type of join