[go: up one dir, main page]

0% found this document useful (0 votes)
66 views29 pages

Data Formats and Machine Learning Methods

The document provides an overview of data formats, categorizing them into text, binary, structured, and unstructured types, with examples and pros/cons for each. It also discusses the Naive Bayes algorithm in healthcare, detailing its applications, advantages, limitations, and best practices for implementation. Additionally, it compares criterion tables with regression models, highlighting their clinical advantages and when to use each approach.

Uploaded by

Jyotsna Siva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views29 pages

Data Formats and Machine Learning Methods

The document provides an overview of data formats, categorizing them into text, binary, structured, and unstructured types, with examples and pros/cons for each. It also discusses the Naive Bayes algorithm in healthcare, detailing its applications, advantages, limitations, and best practices for implementation. Additionally, it compares criterion tables with regression models, highlighting their clinical advantages and when to use each approach.

Uploaded by

Jyotsna Siva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 29

Slide 2: Introduction to Data Formats Content

Definition: Structured ways to store


and transmit data.
Why they matter:
Enable interoperability.
Optimize storage/processing.
Analogy: Like different languages for
data.
• Slide 3: Categories of Data Formats
• Visual: Venn diagram (Text vs. Binary vs.
Structured vs. Unstructured)
Examples:
• Text: CSV, JSON, XML
• Binary: Protobuf, Parquet
• Structured: SQL tables
• Unstructured: Images, videos
• Slide 4: Text-Based Formats
• Content:
• Human-readable, lightweight.
• Examples:
– CSV (simple tables)
– JSON (APIs, configs)
– XML (legacy systems)
Pros/Cons Table:
| Format | Pros | Cons |
|--------|------|------|
| CSV | Simple | No schema |
| JSON | Flexible | No comments |
• Slide 5: Binary Formats
• Content:
• Machine-optimized, compact.
• Examples:
– Protocol Buffers (Google’s high-speed format)
– Parquet (columnar storage for analytics)
Use Case:
• Protobuf in microservices.
Slide 6: Structured vs. Unstructured

Structured
Comparison Table: Unstructured
Image: Example of a database table vs. a social media post.

SQL, CSV Emails, videos


Easy to query Requires AI/ML
• Slide 7: Popular Data Formats
• Visual: Icons of CSV, JSON, XML, Parquet,
Protobuf
Key Points:
• CSV: Spreadsheets, small datasets.
• JSON: Web APIs, NoSQL.
• Parquet: Big Data analytics.
• Slide 8: JSON Deep Dive
• Syntax Example:
• json
• Download
• { "name": "John", "age": 30 }
• Pros:
• Lightweight, easy to parse.
Cons:
• No schema enforcement.
• Slide 9: XML Deep Dive
• Syntax Example:
• xml
• Download
• Run
• <person> <name>John</name> <age>30</age> </person>
• Pros:
• Extensible, supports metadata.
Cons:
• Verbose.
• Slide 10: Protocol Buffers (Protobuf)
• How It Works:
• Define schema in .proto file.
• Compile to binary.
Use Case: gRPC APIs.
• Slide 11: Columnar vs. Row-Based
• Visual: Parquet (columnar) vs. CSV (row-
based)
Why Columnar?
• Faster queries for analytics.
• Slide 12: Choosing the Right Format
• Decision Flowchart:
• Need human-readable? → JSON/XML.
• Need speed? → Protobuf.
• Big Data? → Parquet.
• Slide 13: Future Trends
• Arrow: In-memory columnar format.
• Edge Computing: Compact binary formats.
• Slide 14: Case Study
• Example:
• Netflix uses Avro for data pipelines.
Naive Bayes algorithm
• Overview of Naive Bayes in Healthcare
• Naive Bayes is a classification algorithm based on
Bayes' Theorem with an assumption of independence
among predictors. In healthcare applications, it can:
• Predict disease likelihood based on symptoms and
patient history
• Assist in diagnosis
• Identify high-risk patients
• Classify medical images
• Predict treatment outcomes
• How It Works in Healthcare Context
• Bayes' Theorem Foundation:
P(A|B) = [P(B|A) * P(A)] / P(B)
• Where:
– A = Disease/condition
– B = Symptoms/test results
• "Naive" Assumption: All features (symptoms,
test results) are conditionally independent
given the class (diagnosis)
• Common Healthcare Applications
• 1. Disease Prediction
• Diabetes risk assessment based on BMI, age, family history, etc.
• Cardiovascular disease prediction
• 2. Diagnostic Support
• Differentiating between similar conditions (e.g., types of cancer)
• Interpreting lab results
• 3. Medical Text Analysis
• Classifying clinical notes
• Extracting information from EHRs
• 4. Hospital Operations
• Predicting readmission risk
• Length of stay estimation
• # Example Python implementation using scikit-learn
• from sklearn.naive_bayes import GaussianNB
• from sklearn.model_selection import train_test_split
• from sklearn.metrics import accuracy_score, confusion_matrix

• # Load healthcare dataset (e.g., patient features and diagnosis)


• X = healthcare_data.drop('diagnosis', axis=1) # Features: symptoms, tests, demographics
• y = healthcare_data['diagnosis'] # Target: disease classification

• # Split data
• X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

• # Initialize and train Naive Bayes classifier


• nb_classifier = GaussianNB()
• nb_classifier.fit(X_train, y_train)

• # Make predictions
• predictions = nb_classifier.predict(X_test)

• # Evaluate performance
• print("Accuracy:", accuracy_score(y_test, predictions))
• print("Confusion Matrix:\n", confusion_matrix(y_test, predictions))
• Advantages for Healthcare Analytics
• Handles Missing Data: Works well with incomplete
medical records
• Computationally Efficient: Important for large-scale
medical data
• Interpretable Results: Provides probabilistic outputs
clinicians can understand
• Works with Small Datasets: Valuable for rare diseases
• Handles Both Continuous and Categorical Data: Fits
diverse medical data types
• Limitations and Considerations
• Feature Independence Assumption: Medical
symptoms often correlate
• Zero Frequency Problem: Rare
symptoms/diseases may need smoothing
• Feature Importance: All features treated
equally unless weighted
• Data Quality Dependency: Requires clean,
representative medical data
• Best Practices for Healthcare Implementation
• Feature Selection: Choose clinically relevant predictors
• Data Preprocessing:
– Handle missing values appropriately
– Normalize continuous variables (for Gaussian Naive Bayes)
– Discretize continuous variables when needed
• Model Evaluation:
– Use medical-specific metrics beyond accuracy (sensitivity, specificity)
– Validate with clinical experts
• Explainability:
– Provide probability estimates to clinicians
– Highlight contributing factors to predictions
• Real-World Healthcare Examples
• Cancer Classification: Differentiating tumor
types based on genomic data
• COVID-19 Risk Prediction: Assessing
hospitalization risk from early symptoms
• Mental Health Screening: Identifying
depression risk from patient questionnaires
• Adverse Drug Reaction Prediction: Flagging
potential medication issues
• The Criterion Table Approach
• A criterion table (or decision table) is a
structured framework that:
• Lists relevant clinical factors (predictors)
• Assigns weights or scores to each factor
• Provides decision thresholds based on
accumulated scores
• Outputs diagnostic or prognostic classifications
• Example: Pneumonia Severity Index (PSI)
• Age >50 years: +1 point
• Male sex: +10 points
• Cancer history: +30 points
• Altered mental status: +20 points
• ...
• Total Score → Risk Class I-V
• Similarities to Regression Models
• Shared Characteristics with Linear Regression
• Additive Structure: Both combine weighted predictors
linearly
– Criterion table: Sum(scores)
– Linear regression: β₀ + β₁X₁ + β₂X₂ + ... + βₙXₙ
• Continuous Output Potential: Some criterion tables
produce continuous risk scores similar to linear regression
outputs
• Feature Weighting: Both methods assign different
importance to predictors
• Shared Characteristics with Logistic Regression
• Classification Focus: Both often used for
binary/multiclass outcomes (disease/no disease)
• Threshold-Based Decisions:
– Logistic regression uses probability thresholds (typically
0.5)
– Criterion tables use predefined score cutoffs
• Probabilistic Interpretation: Advanced criterion
tables may provide risk probabilities like logistic
regression
Characteristic Criterion Table Regression Models
Development Often expert-driven Data-driven
Flexibility Fixed structure Adapts to data patterns
Interactions Rarely accounts for them Can model interactions
Simple paper/electronic
Implementation Requires software
form
Updates Manual revision needed Retrain with new data
Interpretability Highly transparent Requires statistical literacy
• Clinical Advantages of Criterion Tables
• Practical Implementation: Can be used at
bedside without computers
• Cognitive Fit: Matches physicians' heuristic
reasoning
• Transparency: Clear scoring system builds
clinician trust
• Regulatory Acceptance: Many are guideline-
endorsed (e.g., CHA₂DS₂-VASc for stroke risk)
• When to Use Each Approach
• Use Criterion Tables When:
• Decision rules need to be implementable in resource-limited
settings
• Clinical expertise is more reliable than available data
• Speed and simplicity are prioritized over optimal accuracy
• Use Regression Models When:
• Large, high-quality datasets are available
• Complex predictor interactions exist
• Continuous probability estimates are needed
• The clinical environment supports digital decision tools
• Hybrid Approaches in Modern Medicine
• Many contemporary clinical decision tools combine
strengths of both:
• Data-derived criterion tables: Using regression
coefficients to inform point assignments
• Electronic implementations: Embedding regression
models behind user-friendly interfaces
• Machine learning hybrids: Using criterion tables as
interpretable components of more complex models

You might also like