Notes on SQL for AI/ML
Part I – SQL Fundamentals for Data Handling
1. What is SQL?
• SQL (Structured Query Language) is used to store, retrieve, and manipulate data in
relational databases.
• Databases are essential in ML/AI for data storage, labeling, and preprocessing before
modeling.
Popular SQL engines: MySQL, PostgreSQL, SQLite, MSSQL, Google BigQuery.
2. Basic SQL Commands
Type Purpose Examples
Define or modify database
DDL (Data Definition Language) CREATE, ALTER, DROP
structure
DML (Data Manipulation INSERT, UPDATE,
Add or modify data
Language) DELETE
DQL (Query Language) Retrieve data SELECT
DCL (Control Language) Control permissions GRANT, REVOKE
3. Data Retrieval
SELECT column1, column2
FROM table_name
WHERE condition
ORDER BY column1 DESC
LIMIT 10;
Example:
SELECT age, salary
FROM employees
WHERE salary > 50000
ORDER BY age ASC;
4. Filtering and Conditions
Operator Meaning
= Equal
<> or != Not equal
Notes on SQL for AI/ML
Operator Meaning
> < >= <= Comparison
AND, OR, NOT Logical
BETWEEN, IN, LIKE Range & pattern filters
Example:
SELECT * FROM sales
WHERE region IN ('Asia', 'Europe') AND revenue BETWEEN 10000 AND 50000;
5. Aggregations
Useful for summarizing or grouping data before feeding into ML models.
SELECT department, AVG(salary), MAX(salary)
FROM employees
GROUP BY department
HAVING AVG(salary) > 40000;
Common Functions: COUNT(), SUM(), AVG(), MAX(), MIN()
6. Joins – Combining Data Sources
Data for ML often comes from multiple tables.
Join Type Description
INNER JOIN Matching rows in both tables
LEFT JOIN All rows from left + matching from right
RIGHT JOIN All from right + matching from left
FULL JOIN All records from both sides
Example:
SELECT e.name, d.department_name
FROM employees e
JOIN departments d ON e.dept_id = d.id;
Part II – SQL for Data Analysis & Feature Engineering
7. Data Cleaning with SQL
Notes on SQL for AI/ML
AI models need clean, consistent data.
Common cleaning operations:
• Handling missing data
SELECT * FROM data WHERE value IS NULL;
UPDATE data SET value = 0 WHERE value IS NULL;
• Removing duplicates
DELETE FROM table
WHERE id NOT IN (
SELECT MIN(id) FROM table GROUP BY column1, column2
);
• Fixing inconsistent text
UPDATE users
SET city = INITCAP(city); -- Standardize case
8. Data Transformation for Features
SQL can create derived variables (features) for ML models.
• Creating new columns
ALTER TABLE customers ADD COLUMN income_level TEXT;
UPDATE customers
SET income_level = CASE
WHEN income > 80000 THEN 'High'
WHEN income BETWEEN 40000 AND 80000 THEN 'Medium'
ELSE 'Low'
END;
• Scaling / normalization (conceptually)
You can compute min/max or z-score values:
SELECT (salary - AVG(salary)) / STDDEV(salary) AS z_score
FROM employees;
• Binning / Categorization
SELECT
CASE
WHEN age < 20 THEN 'Teen'
WHEN age BETWEEN 20 AND 35 THEN 'Adult'
ELSE 'Senior'
END AS age_group,
COUNT(*) AS count
FROM users
Notes on SQL for AI/ML
GROUP BY age_group;
9. Statistical Analysis in SQL
SQL supports basic descriptive statistics directly — useful before exporting to Python/R.
SELECT
AVG(age) AS mean_age,
STDDEV(age) AS std_age,
MIN(age),
MAX(age)
FROM users;
Correlations (approximation):
While not native in SQL, some engines (e.g., PostgreSQL) allow:
SELECT CORR(income, spending_score) FROM customers;
10. Time-Series and Sequential Data
Used in AI for forecasting, behavior tracking, or IoT.
SELECT
DATE(timestamp) AS day,
AVG(sensor_value) AS avg_value
FROM sensors
GROUP BY day
ORDER BY day;
You can also calculate moving averages:
SELECT
timestamp,
AVG(sensor_value)
OVER (ORDER BY timestamp ROWS BETWEEN 3 PRECEDING AND CURRENT ROW) AS
moving_avg
FROM sensors;
11. CTEs (Common Table Expressions)
They simplify complex analysis.
WITH avg_sales AS (
SELECT product_id, AVG(revenue) AS avg_rev
FROM sales
GROUP BY product_id
)
SELECT p.name, a.avg_rev
Notes on SQL for AI/ML
FROM products p
JOIN avg_sales a ON p.id = a.product_id;
This is handy when building multi-step data pipelines for ML.
Part III – SQL in ML Pipelines & Advanced Analytics
12. Preparing Data for ML
SQL can output data directly to CSV or ML tools.
Example pipeline:
1. Collect: from raw tables (sales, users, transactions)
2. Clean: remove duplicates/nulls
3. Engineer: new features (avg_purchase, days_active)
4. Export: to ML tool
SELECT
user_id,
COUNT(order_id) AS total_orders,
AVG(amount) AS avg_order_value,
MAX(order_date) AS last_order_date,
EXTRACT(DAY FROM CURRENT_DATE - MAX(order_date)) AS days_since_last_order
FROM orders
GROUP BY user_id;
13. Integrating SQL with ML Tools
SQL often acts as the data source for:
• Python (pandas, scikit-learn) → pd.read_sql()
• R → dbGetQuery()
• Power BI / Tableau / BigQuery ML
BigQuery ML example:
CREATE MODEL my_model
OPTIONS(model_type='linear_reg') AS
SELECT
age, income, spending_score
FROM customers;
14. Advanced Analytical Queries
Notes on SQL for AI/ML
• Ranking and Window Functions
SELECT
user_id,
SUM(amount) AS total_spent,
RANK() OVER (ORDER BY SUM(amount) DESC) AS rank
FROM orders
GROUP BY user_id;
• Anomaly Detection Prep
SELECT *,
CASE
WHEN amount > (AVG(amount) + 3 * STDDEV(amount)) THEN 'Outlier'
ELSE 'Normal'
END AS anomaly_flag
FROM transactions;
15. SQL for AI Use Cases
Use Case SQL Role
Data preprocessing Clean, join, filter, and aggregate raw data
Feature engineering Create categorical/continuous features
Model training Export structured data to ML framework
Monitoring AI models Track predictions and metrics in SQL tables
Explainability Compare predicted vs actual outcomes
16. Real-World Example: Customer Churn Dataset
SELECT
customer_id,
AVG(call_duration) AS avg_call_time,
COUNT(complaints) AS complaint_count,
CASE WHEN SUM(payment_late) > 0 THEN 1 ELSE 0 END AS payment_flag,
CASE WHEN churned = 'Yes' THEN 1 ELSE 0 END AS churn_label
FROM telecom_data
GROUP BY customer_id;
Then export this dataset to Python for model training (classification task).
17. Key Takeaways
• SQL is the foundation for all ML/AI data preparation.
• Use SQL for:
Notes on SQL for AI/ML
o Cleaning, joining, and aggregating data
o Feature creation and basic statistics
o Time-series and anomaly tracking
• Combine with Python/R for actual model training