[go: up one dir, main page]

0% found this document useful (0 votes)
25 views38 pages

Big Data Day II

Uploaded by

atakilti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views38 pages

Big Data Day II

Uploaded by

atakilti
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Big Data Preprocessing

Atakilti Brhanu
Introduction to Data Preprocessing
• Data preprocessing is a crucial step in data analysis and machine learning
that involves transforming raw data into a clean and usable format. It
improves the quality of the data, ensuring that models built upon it are more
accurate and reliable.

• Importance of Data Preprocessing


• Improves Model Performance: Clean, consistent data leads to better and more
accurate predictions.
• Reduces Complexity: Preprocessing helps simplify data and models, making them
easier to interpret.
• Handles Data Issues: Identifies and fixes issues such as missing values, outliers, and
noisy data.
Data Cleaning
• Data Cleaning is Ensuring Data Accuracy and Reliability
• Handling Missing Data: Involves either removing missing data points or imputing them with
meaningful values (e.g., mean, median).
• Outlier Removal: Detecting and removing data points that deviate significantly from other observations,
as these can skew model results.
• Noise Reduction: Smoothing data to remove any inconsistencies or "noise" that might mislead analysis.

Why Data Cleaning Matters


• Data cleaning is a fundamental step in Big Data preprocessing. It involves identifying and correcting errors,
inconsistencies, and inaccuracies within your dataset. By cleaning your data, you ensure that your analysis
and modeling results are accurate and reliable. Clean data is essential for making informed decisions and
drawing meaningful insights.
Data Cleaning…
Key Data Cleaning Techniques
• Missing Value Handling: Dealing with missing data points by using
imputation methods like mean, median, or mode replacement.
• Outlier Detection and Removal: Identifying and removing extreme
values that may skew your analysis results.
• Data Standardization: Transforming data to a common scale to
facilitate comparisons and ensure consistency.
• Data Transformation: Applying mathematical functions to modify data
distributions and improve model performance.
Data Integration
• Merging Data from Different Sources:
• Data integration combines data from multiple sources into a unified dataset. This process
eliminates inconsistencies, reduces redundancy, and creates a comprehensive view of your
data. Effective data integration allows for more accurate analysis and more valuable
insights.
• Data Transformation and Standardization
• Data transformation and standardization are crucial aspects of data integration.
Transforming data involves converting it into a consistent format. Standardization ensures
that all data is measured using the same units and scales, making comparisons and
analysis more effective.
• Data Validation and Quality Control
• Data validation and quality control are essential after integration. This step ensures that
the combined data meets predetermined quality standards, identifying and correcting
errors, inconsistencies, and redundancies to maintain data integrity and reliability.
Data Transformation
• Data Transformation: Shaping Data for Analysis
 Normalization/Standardization: Scaling data so that features
have a consistent range or mean/variance, improving the
performance of algorithms sensitive to data scales.
 Encoding Categorical Data: Converting categorical variables
into numerical values (e.g., using one-hot encoding or label
encoding) for algorithms that require numerical input.
 Discretization: Converting continuous attributes into categorical
ones, often for the purpose of easier model interpretability or
handling specific types of data.
Data Reduction
• Data Reduction: Streamlining Data for Efficiency
• Dimensionality Reduction: Reduces the number of variables in a dataset
while preserving as much information as possible, improving model efficiency
and interpretability. Methods include Principal
• Sampling: Selects a representative subset of data from a larger dataset,
reducing computational complexity and enabling faster analysis. Techniques
include random sampling, stratified sampling, and cluster sampling.
• Data Aggregation
• Combines data points into groups or summaries, reducing data volume while
retaining essential information. Examples include calculating averages, sums,
or other aggregate statistics for groups of data points.
• Data Pruning
• Removes irrelevant or redundant data points, simplifying the dataset and
improving model performance. This can involve removing features with low
variance or eliminating outliers.
Data Splitting
• Training and Testing: Dividing the dataset into separate training and
testing sets to evaluate model performance on unseen data.
Big Data Analysis
Introduction to Data Analytics
• Data analytics is the process of examining data to extract insights and
information that can be used to make informed decisions.

• It involves collecting, cleaning, organizing, and analyzing data to


uncover patterns, trends, and relationships. This process includes data
storage, retrieval, and the application of tools and techniques to extract
meaningful insights.
Data Analytics Vs. Data Analysis
• While the terms data analytics and data analysis are frequently used
interchangeably, data analysis is a subset of data analytics concerned with
examining, cleansing, transforming, and modeling data to derive conclusions.
• Data analytics includes the tools and techniques used to perform data analysis.
Data Analysis Data Analytics
• Core focus: The process of examining • Core focus: A broader field encompassing
data to uncover patterns, trends, and data analysis, as well as data collection,
relationships. preparation, storage, and interpretation.
• Scope: Typically involves a smaller • Scope: Often deals with large datasets and
dataset and simpler techniques. complex analyses.
• Goal: To answer specific questions or • Goal: To extract insights and information
solve particular problems. that can be used to make informed
decisions.
Data Analytics Vs. Data Analysis...
• In essence, data analysis is a
component of data analytics.

• Data analytics provides a more


comprehensive framework for handling
and utilizing data, while data analysis
focuses specifically on the process of
examining the data itself.
Types of Data Analytics
1. Descriptive analytics:
• Descriptive analytics refers to the use of data analysis techniques to
summarize historical data and provide insights into what has happened in
the past.
• Purpose: Helps organizations understand their past performance, identify
trends, and make informed decisions.
• Key techniques:
• Data Aggregation: Combining data from various sources to create a comprehensive
view.
• Data Visualization: Using charts, graphs, and dashboards to present data clearly.
• Statistical Analysis: Applying statistical methods to interpret data trends and
patterns.
Examples of Descriptive Analytics
Descriptive Analytics Applications in SEO

Example 1: Keyword Performance Analysis


• Data Collection: Gather data on keyword rankings, search volume, and click-through
rates (CTR) using tools like Google Search Console and SEMrush.
• Analysis: Aggregate this data to see which keywords are driving traffic and how their
rankings have changed over time.
• Visualization: Use line charts to track keyword performance over different periods.
• Outcome: Identify top-performing keywords and seasonal trends in search behavior.
2. Diagnostic analytics:
• Diagnostic analytics goes beyond describing the data to explore the reasons
behind observed trends and patterns.
• It helps answer "why" questions by identifying root causes and contributing
factors, often using data visualization and statistical analysis.
• Purpose: Helps organizations identify factors contributing to trends, or
changes in performance.
• Key techniques:
• Data Exploration: Analyzing data sets to uncover patterns or relationships.
• Statistical Techniques: Using correlation, regression, and other methods to identify causality.
• Root Cause Analysis: Investigating underlying reasons for specific outcomes or trends.
Example: Banking Use Case: Loan Return Analysis
• Identify factors influencing loan defaults and successful returns.
• Analysis Process:
• Data Collection: Loan data, credit scores, customer financial profiles,
repayment schedules.
• Pattern Analysis: Identify factors like income, loan size, interest rates, and
credit history affecting outcomes.
• Segmentation: Group loans based on risk factors: high loan amount, low
income, poor credit score.
• Root Cause Analysis: Determine key causes of default (e.g., economic
downturn, insufficient income).
• Outcome:
• Risk Mitigation: Better loan approval criteria.
• Targeted Interventions: Personalized repayment plans for at-risk customers.
• Improved Loan Performance: Minimized defaults, optimized returns.
3. Predictive analytics:
• Predictive analytics leverages historical and current data to identify
patterns and trends, enabling users to forecast future outcomes.

• By integrating AI, machine learning (ML), and data mining


techniques, predictive models can analyze large datasets to make
accurate predictions.

• These technologies improve the efficiency and accuracy of


predictions, helping businesses forecast market trends, consumer
behaviors, financial outcomes, and more.
Types of predictive modeling
• Predictive analytics models are designed to assess historical data,
discover patterns, observe trends, and use that information to predict
future trends.
• Various types of predictive modeling techniques are employed,
depending on the nature of the data, Popular predictive analytics
models include:-
Classification,
Clustering,
Regression Analysis and
Time series models.
Classification models
• Predicts a categorical outcome, assigning data points to specific classes.
Applications include identifying fraudulent transactions, classifying customer
segments, and predicting loan defaults.

• For example, this model can be used to classify customers or prospects into groups
for segmentation purposes. Alternatively, it can also be used to answer questions
with binary outputs, such answering yes or no or true and false; popular use cases
for this are fraud detection and credit risk evaluation.

• Types of classification models include logistic regression, decision trees, random


forest, neural networks, and Naïve Bayes.
Classification models…
Data Mining Application: Diabetes Prediction Using Random Forest

• Input Data: Age, (Body Mass Index), blood sugar level, family history
of diabetes.
• Classification Model: Random Forest, trained to classify patients as
"Diabetic" or "Non-Diabetic."
• Outcome: Model identifies high-risk patients based on patterns in the
data and helps doctors intervene earlier for disease prevention or
management.
Clustering models
• Groups data points based on their similarities, identifying natural
clusters within the data.
• Applications include customer segmentation, difference detection, and
identifying similar products.
Clustering models…
• Group emails based on content to improve marketing and customer service.
1. Machine Learning (K-Means Clustering):
• Process:
• Data Collection: Gather email metadata and content.
• Preprocessing: Convert email text using TF-IDF.
• Modeling: Apply K-Means to group emails into topics.
• Outcome:
• Automatic categorization: "Promotions," "Customer Queries," "Internal Issues.“
2. Data Mining (Hierarchical Clustering):
• Process:
• Data Collection: Gather support emails.
• Preprocessing: Tokenize and convert emails to vectors.
• Modeling: Use Hierarchical Clustering to identify priority topics.
• Outcome:
• Prioritize customer support emails by issue type (e.g., "Urgent").
Regression Analysis
• Regression analysis is a statistical method used to model the
relationship between a dependent variable (the outcome) and one or
more independent variables (predictors).

• It's a foundation in data science and statistics, offering powerful tools


for prediction, modeling, and understanding complex relationships.
Time Series Modeling
• Analyzes data points collected over time to identify patterns and trends.
Applications include forecasting stock prices, predicting product
demand, and analyzing website traffic.

• Example: Time series modeling is widely used in financial markets to


forecast stock prices based on historical data. By analyzing patterns and
trends over time, ML algorithms can predict future stock movements.
4. Prescriptive analytics:
• Prescriptive analytics goes beyond prediction, offering recommendations and
suggesting actions to optimize outcomes.
• It uses optimization algorithms, simulations, and decision models to suggest the
best course of action based on predictions.

• Methods/Tools:
• Optimization techniques (e.g., linear programming), simulations, and decision analysis.
• Tools such as IBM Decision Optimization, SAS, and Gurobi are commonly used for
prescriptive analytics.
• Example:
• Recommending the best supply chain routes to minimize costs while meeting demand.
• Offering personalized product suggestions to customers based on predicted preferences.
Types of Prescriptive Modeling:
• Various types of prescriptive modeling approaches exist, each suitable
for different decision-making scenarios.
Optimization Models
Simulation Models
Decision Analysis Models
Machine Learning-Based Prescriptive Models (Reinforcement Learning,
Supervised Learning and Unsupervised Learning)
Type of Analytics Goal Question Answered Methods/Tools Example

Data aggregation,
Monthly sales reports,
Descriptive Analytics Understand past events “What happened?” visualization (e.g.,
web traffic analysis
dashboards)

Analyzing why
Identify causes of past Root cause analysis,
Diagnostic Analytics “Why did it happen?” customer churn
events correlation analysis
occurred

Machine learning, data Predicting future sales,


Predict future “What is likely to
Diagnostic Analytics mining, regression, customer churn
outcomes happen?”
time series analysis prediction

Supply chain
Optimization
optimization,
Diagnostic Analytics Recommend actions “What should we do?” algorithms, decision
personalized
models, simulations
recommendations
How Big Data Analytics works
• Big data analytics refers to collecting, processing, cleaning, and
analyzing large datasets to help organizations operationalize their big
data.
i. Data Collection and Storage
• Data Sources: Big data originates from various sources, including databases,
social media platforms, sensors, and web logs.
• Data Warehouses: Data warehouses are centralized repositories for storing
large volumes of data from multiple sources, often used for analytical
purposes.
• Data Lakes: Data lakes are vast repositories that store raw data in its native
format, allowing for flexibility and scalability.
• Cloud Storage: Cloud storage services provide scalable and cost-effective
solutions for storing and managing large datasets.
How Big Data Analytics works…
ii. Data Preprocessing and Transformation
• Data Cleaning: This involves identifying and correcting errors,
inconsistencies, and missing values in the data. It ensures data accuracy
and reliability for analysis.
• Data Transformation: This involves converting data into a format suitable
for analysis. It may include scaling, normalization, or encoding data to
make it consistent and comparable.
• Feature Engineering: This involves creating new features or variables
from existing data to improve the accuracy and effectiveness of analysis.
• Data Reduction: This involves reducing the size of the dataset by
removing redundant or irrelevant data, making analysis more efficient.
How Big Data Analytics works…
iii. Data Analysis Techniques
Choosing the most appropriate analytics model depends on the nature of
the problem, the availability of data, and the desired outcomes.

Overview of Analytics Models:


• Descriptive Analytics
• Diagnostic Analytics
• Predictive Analytics
• Prescriptive Analytics
Big Data Visualization
Big Data Visualization
➔ Visualization is the graphical representation of information and data. By
using visual elements like charts, graphs, and maps, visualization tools
provide an accessible way to see and understand trends, outliers, and
patterns in data.

● Types of Visualization:
○ Static Visualization: Fixed visuals such as printed charts and graphs that do not change in
response to user input.

○ Dynamic Visualization: Interactive visuals that allow users to engage with the data (e.g.,
dashboards, interactive charts).

○ 3D Visualization: Visual representations that incorporate three dimensions, often used in


scientific fields (e.g., molecular structures).
Big Data Visualization
➔ Visualization is the graphical representation of information and data. By
using visual elements like charts, graphs, and maps, visualization tools
provide an accessible way to see and understand trends, outliers, and
patterns in data.
○ Data Representation: Data visualization transforms data into visual
representations, making it easier to understand and interpret.

○ Pattern Discovery: Visualizations help identify trends, patterns, and anomalies in


data that might be difficult to discern through raw data alone.

○ Communication: Visualizations effectively communicate insights and findings to a


wider audience, facilitating understanding and decision-making
Existing Visualization Techniques
Data Mining Techniques
● These techniques focus on extracting meaningful patterns and insights from
large datasets. Examples include clustering, classification, and association
rule mining.
Encoding Techniques
● These techniques involve mapping data attributes to visual elements, such
as color, size, shape, and position. Effective encoding helps convey
information clearly and efficiently.
Layout Techniques
● These techniques focus on arranging visual elements in a way that enhances
readability and understanding. Examples include scatter plots, bar charts,
and network diagrams.
Techniques for Big Data Visualization
Sampling
● This technique involves selecting a representative subset of the data to
visualize, reducing the volume of data while preserving key insights.
Aggregation
● This technique combines data points into groups or summaries, simplifying
the visualization and highlighting overall trends.
Dimensionality Reduction
● This technique reduces the number of variables or dimensions in the data,
making it easier to visualize and analyze complex datasets.
Interactive Visualization
● This technique allows users to explore and interact with data visualizations,
providing dynamic insights and enabling deeper analysis.
Existing Visualization Techniques
Know Your Audience
● Tailor visualizations to the specific needs and understanding of your intended audience. Consider
their background, expertise, and the purpose of the visualization.
Choose the Right Chart Type
● Select a chart type that effectively communicates the data and insights you want to convey. Avoid
using complex or misleading charts.
Use Clear and Concise Labels
● Ensure that all axes, legends, and data points are clearly labeled and easy to understand. Avoid
using jargon or overly technical language.
Emphasize Key Insights
● Highlight the most important findings and trends in your visualizations using color, size, or other
visual cues. Avoid overwhelming the audience with too much information.
Provide Context and Narrative
● Supplement visualizations with text or annotations that provide context and explain the story behind
the data. This helps the audience understand the meaning and implications of the visualizations.
Big Data Visualization Challenges
Data Volume
● Visualizing massive datasets requires specialized tools and techniques to handle the
sheer volume of data and avoid performance bottlenecks.
Data Complexity
● Big data often involves complex relationships and structures, making it challenging to
create visualizations that effectively convey the underlying patterns and insights.
Data Velocity
● Real-time or near real-time data streams require dynamic visualization techniques that
can adapt to changing data patterns and provide timely insights.
Data Variety
● Big data often includes diverse data types, such as text, images, and sensor data,
requiring visualization techniques that can handle heterogeneous data sources.

You might also like