Slide 2: Introduction to Data Formats Content
Definition: Structured ways to store
and transmit data.
Why they matter:
Enable interoperability.
Optimize storage/processing.
Analogy: Like different languages for
data.
• Slide 3: Categories of Data Formats
• Visual: Venn diagram (Text vs. Binary vs.
Structured vs. Unstructured)
Examples:
• Text: CSV, JSON, XML
• Binary: Protobuf, Parquet
• Structured: SQL tables
• Unstructured: Images, videos
• Slide 4: Text-Based Formats
• Content:
• Human-readable, lightweight.
• Examples:
– CSV (simple tables)
– JSON (APIs, configs)
– XML (legacy systems)
Pros/Cons Table:
| Format | Pros | Cons |
|--------|------|------|
| CSV | Simple | No schema |
| JSON | Flexible | No comments |
• Slide 5: Binary Formats
• Content:
• Machine-optimized, compact.
• Examples:
– Protocol Buffers (Google’s high-speed format)
– Parquet (columnar storage for analytics)
Use Case:
• Protobuf in microservices.
Slide 6: Structured vs. Unstructured
Structured
Comparison Table: Unstructured
Image: Example of a database table vs. a social media post.
SQL, CSV Emails, videos
Easy to query Requires AI/ML
• Slide 7: Popular Data Formats
• Visual: Icons of CSV, JSON, XML, Parquet,
Protobuf
Key Points:
• CSV: Spreadsheets, small datasets.
• JSON: Web APIs, NoSQL.
• Parquet: Big Data analytics.
• Slide 8: JSON Deep Dive
• Syntax Example:
• json
• Download
• { "name": "John", "age": 30 }
• Pros:
• Lightweight, easy to parse.
Cons:
• No schema enforcement.
• Slide 9: XML Deep Dive
• Syntax Example:
• xml
• Download
• Run
• <person> <name>John</name> <age>30</age> </person>
• Pros:
• Extensible, supports metadata.
Cons:
• Verbose.
• Slide 10: Protocol Buffers (Protobuf)
• How It Works:
• Define schema in .proto file.
• Compile to binary.
Use Case: gRPC APIs.
• Slide 11: Columnar vs. Row-Based
• Visual: Parquet (columnar) vs. CSV (row-
based)
Why Columnar?
• Faster queries for analytics.
• Slide 12: Choosing the Right Format
• Decision Flowchart:
• Need human-readable? → JSON/XML.
• Need speed? → Protobuf.
• Big Data? → Parquet.
• Slide 13: Future Trends
• Arrow: In-memory columnar format.
• Edge Computing: Compact binary formats.
• Slide 14: Case Study
• Example:
• Netflix uses Avro for data pipelines.
Naive Bayes algorithm
• Overview of Naive Bayes in Healthcare
• Naive Bayes is a classification algorithm based on
Bayes' Theorem with an assumption of independence
among predictors. In healthcare applications, it can:
• Predict disease likelihood based on symptoms and
patient history
• Assist in diagnosis
• Identify high-risk patients
• Classify medical images
• Predict treatment outcomes
• How It Works in Healthcare Context
• Bayes' Theorem Foundation:
P(A|B) = [P(B|A) * P(A)] / P(B)
• Where:
– A = Disease/condition
– B = Symptoms/test results
• "Naive" Assumption: All features (symptoms,
test results) are conditionally independent
given the class (diagnosis)
• Common Healthcare Applications
• 1. Disease Prediction
• Diabetes risk assessment based on BMI, age, family history, etc.
• Cardiovascular disease prediction
• 2. Diagnostic Support
• Differentiating between similar conditions (e.g., types of cancer)
• Interpreting lab results
• 3. Medical Text Analysis
• Classifying clinical notes
• Extracting information from EHRs
• 4. Hospital Operations
• Predicting readmission risk
• Length of stay estimation
• # Example Python implementation using scikit-learn
• from sklearn.naive_bayes import GaussianNB
• from sklearn.model_selection import train_test_split
• from sklearn.metrics import accuracy_score, confusion_matrix
• # Load healthcare dataset (e.g., patient features and diagnosis)
• X = healthcare_data.drop('diagnosis', axis=1) # Features: symptoms, tests, demographics
• y = healthcare_data['diagnosis'] # Target: disease classification
• # Split data
• X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
• # Initialize and train Naive Bayes classifier
• nb_classifier = GaussianNB()
• nb_classifier.fit(X_train, y_train)
• # Make predictions
• predictions = nb_classifier.predict(X_test)
• # Evaluate performance
• print("Accuracy:", accuracy_score(y_test, predictions))
• print("Confusion Matrix:\n", confusion_matrix(y_test, predictions))
• Advantages for Healthcare Analytics
• Handles Missing Data: Works well with incomplete
medical records
• Computationally Efficient: Important for large-scale
medical data
• Interpretable Results: Provides probabilistic outputs
clinicians can understand
• Works with Small Datasets: Valuable for rare diseases
• Handles Both Continuous and Categorical Data: Fits
diverse medical data types
• Limitations and Considerations
• Feature Independence Assumption: Medical
symptoms often correlate
• Zero Frequency Problem: Rare
symptoms/diseases may need smoothing
• Feature Importance: All features treated
equally unless weighted
• Data Quality Dependency: Requires clean,
representative medical data
• Best Practices for Healthcare Implementation
• Feature Selection: Choose clinically relevant predictors
• Data Preprocessing:
– Handle missing values appropriately
– Normalize continuous variables (for Gaussian Naive Bayes)
– Discretize continuous variables when needed
• Model Evaluation:
– Use medical-specific metrics beyond accuracy (sensitivity, specificity)
– Validate with clinical experts
• Explainability:
– Provide probability estimates to clinicians
– Highlight contributing factors to predictions
• Real-World Healthcare Examples
• Cancer Classification: Differentiating tumor
types based on genomic data
• COVID-19 Risk Prediction: Assessing
hospitalization risk from early symptoms
• Mental Health Screening: Identifying
depression risk from patient questionnaires
• Adverse Drug Reaction Prediction: Flagging
potential medication issues
• The Criterion Table Approach
• A criterion table (or decision table) is a
structured framework that:
• Lists relevant clinical factors (predictors)
• Assigns weights or scores to each factor
• Provides decision thresholds based on
accumulated scores
• Outputs diagnostic or prognostic classifications
• Example: Pneumonia Severity Index (PSI)
• Age >50 years: +1 point
• Male sex: +10 points
• Cancer history: +30 points
• Altered mental status: +20 points
• ...
• Total Score → Risk Class I-V
• Similarities to Regression Models
• Shared Characteristics with Linear Regression
• Additive Structure: Both combine weighted predictors
linearly
– Criterion table: Sum(scores)
– Linear regression: β₀ + β₁X₁ + β₂X₂ + ... + βₙXₙ
• Continuous Output Potential: Some criterion tables
produce continuous risk scores similar to linear regression
outputs
• Feature Weighting: Both methods assign different
importance to predictors
• Shared Characteristics with Logistic Regression
• Classification Focus: Both often used for
binary/multiclass outcomes (disease/no disease)
• Threshold-Based Decisions:
– Logistic regression uses probability thresholds (typically
0.5)
– Criterion tables use predefined score cutoffs
• Probabilistic Interpretation: Advanced criterion
tables may provide risk probabilities like logistic
regression
Characteristic Criterion Table Regression Models
Development Often expert-driven Data-driven
Flexibility Fixed structure Adapts to data patterns
Interactions Rarely accounts for them Can model interactions
Simple paper/electronic
Implementation Requires software
form
Updates Manual revision needed Retrain with new data
Interpretability Highly transparent Requires statistical literacy
• Clinical Advantages of Criterion Tables
• Practical Implementation: Can be used at
bedside without computers
• Cognitive Fit: Matches physicians' heuristic
reasoning
• Transparency: Clear scoring system builds
clinician trust
• Regulatory Acceptance: Many are guideline-
endorsed (e.g., CHA₂DS₂-VASc for stroke risk)
• When to Use Each Approach
• Use Criterion Tables When:
• Decision rules need to be implementable in resource-limited
settings
• Clinical expertise is more reliable than available data
• Speed and simplicity are prioritized over optimal accuracy
• Use Regression Models When:
• Large, high-quality datasets are available
• Complex predictor interactions exist
• Continuous probability estimates are needed
• The clinical environment supports digital decision tools
• Hybrid Approaches in Modern Medicine
• Many contemporary clinical decision tools combine
strengths of both:
• Data-derived criterion tables: Using regression
coefficients to inform point assignments
• Electronic implementations: Embedding regression
models behind user-friendly interfaces
• Machine learning hybrids: Using criterion tables as
interpretable components of more complex models