Cleaning data in SQL involves identifying and correcting errors, inconsistencies, or incomplete data in you
1. Remove Duplicate Records
Use the DISTINCT keyword or ROW_NUMBER() function to identify and remove duplicates.
-- Example: Remove duplicates based on specific columns
DELETE FROM your_table
WHERE id NOT IN (
SELECT MIN(id)
FROM your_table
GROUP BY column1, column2
);
2. Handle Missing or Null Values
Replace NULL values with default values or meaningful substitutes using COALESCE() or CASE.
-- Example: Replace NULL with a default value
UPDATE your_table
SET column_name = COALESCE(column_name, 'Default Value');
3. Standardize Data Formats
Use functions like UPPER(), LOWER(), TRIM(), or FORMAT() to ensure consistency in text, dates, or numbers.
-- Example: Standardize text to uppercase
UPDATE your_table
SET column_name = UPPER(column_name);
4. Remove Unwanted Characters
Use REPLACE() or REGEXP_REPLACE() to clean up unwanted characters.
-- Example: Remove special characters
UPDATE your_table
SET column_name = REPLACE(column_name, '-', '');
5. Validate Data Integrity
Use constraints or queries to identify invalid data (e.g., out-of-range values).
-- Example: Find invalid data
SELECT *
FROM your_table
WHERE column_name NOT BETWEEN 1 AND 100;
6. Deduplicate with CTEs
Use Common Table Expressions (CTEs) to identify and delete duplicates.
WITH CTE AS (
SELECT column1, column2, ROW_NUMBER() OVER (PARTITION BY column1, column2 ORDER BY id
FROM your_table
)
DELETE FROM your_table
WHERE id IN (SELECT id FROM CTE WHERE row_num > 1);
7. Remove Outliers
Identify and remove outliers using statistical thresholds.
-- Example: Remove outliers based on a threshold
DELETE FROM your_table
WHERE column_name > 1000 OR column_name < 0;
8. Join with Reference Tables
Use joins to validate and correct data against reference tables.
-- Example: Update invalid data using a reference table
UPDATE your_table
SET column_name = ref_table.correct_value
FROM reference_table ref_table
WHERE your_table.column_name = ref_table.invalid_value;
By combining these techniques, you can ensure your data is clean, consistent, and ready for analysis or further processing.
or incomplete data in your database. Here are some common techniques to clean data effectively:
xt, dates, or numbers.
1, column2 ORDER BY id) AS row_num
lysis or further processing.