Skip to main content

Command Palette

Search for a command to run...

Data Preprocessing in Machine Learning

Basics of Data Preprocessing

Published
6 min read
Data Preprocessing in Machine Learning

A complete checklist for EDA, Cleaning, and Feature Engineering.

While neural architecture design often garners the most attention in Machine Learning, the efficacy of any model is fundamentally constrained by the quality of its input data. Real-world datasets are rarely pristine; they frequently contain inconsistencies, missing values, and formatting errors.

If compromised data is fed into a model, the output will inevitably be flawed a principle known as "Garbage In, Garbage Out."

This post outlines a systematic "preprocessing pipeline" I have established for ensuring data integrity and readiness, utilizing Python and the Pandas library.

1. Exploratory Data Analysis (EDA)

Goal: To audit the dataset's structure, quality, and underlying statistical properties before algorithmic application.

EDA is the diagnostic phase. It is essential to verify the integrity of the dataset before proceeding.

MetricObjectivePython Method
Shape & StructureAssess dataset dimensionality (observations vs. features).df.shape, df.head()
Data TypesIdentify type mismatches (e.g., numeric values stored as strings).df.info(), df.dtypes
Statistical SummaryAnalyze central tendencies (mean, median) and dispersion.df.describe()
CardinalityDetermine the number of unique values in categorical features.df['col'].nunique()

Technical Note: Executing df.info() is the primary step. A common issue is numeric columns being cast as "Objects" due to the presence of non-numeric characters (e.g., currency symbols or corrupted entries).

2. Data Cleaning

Goal: To resolve anomalies and errors that would otherwise cause model instability or convergence failure.

A. Handling Missing Values (Imputation)

There are three primary strategies for addressing Null/NaN values:

  1. Deletion: Removing rows with excessive missing data. (df.dropna())

  2. Statistical Imputation:

    • Numerical: Replacing missing values with the Mean or Median.

    • Categorical: Replacing missing values with the Mode (frequency).

  3. Model-Based Imputation: Utilizing algorithms (e.g., KNN Imputer) to infer missing values based on correlations.

from sklearn.impute import SimpleImputer

# Imputing missing 'Age' values with the Median to remain robust against outliers
imputer = SimpleImputer(strategy='median')
df['Age'] = imputer.fit_transform(df[['Age']])

B. Handling Duplicates

Duplicate observations can artificially inflate the weight of specific data points, leading to overfitting.

  • Resolution: df.drop_duplicates()

C. Handling Noisy Data

Noise refers to irrelevant or incorrect data that is difficult for machines to interpret. We use three techniques to smooth it out:

  • Binning: Sorting data into segments (bins) and smoothing them by replacing values with the bin mean or boundary values.

  • Regression: Fitting data to a linear or multiple regression function to smooth out fluctuations.

  • Clustering: Grouping similar data points; outliers that fall outside these clusters are identified as noise and removed.

D. Outlier Detection and Removal

Extreme deviations can significantly skew statistical parameters (like Mean and Variance), particularly in sensitive models like Linear Regression.

  • Z-Score Method: Filters data points exceeding $\pm3$ Standard Deviations from the mean.

  • IQR Method: Filters data falling outside the Interquartile Range.

$$Q1 - 1.5 \times IQR$$

$$Q3 + 1.5 \times IQR$$

2. Data Integration

Goal: To merge data from various sources into a single, unified dataset.

In real-world systems, data rarely lives in one CSV file. It sits in SQL databases, JSON logs, and third-party APIs.

  • Record Linkage: The process of identifying records from different datasets that refer to the same entity (e.g., matching "J. Smith" in Sales with "John Smith" in HR).

  • Data Fusion: Combining data from multiple sources to create a richer dataset, resolving inconsistencies (e.g., one source says "Male," another says "M").

3. Feature Engineering and Transformation

Goal: To translate raw data into a numerical format compatible with mathematical optimization.

A. Categorical Encoding

Machine Learning models require numerical input. Textual labels must be mapped to vector representations.

  • Label Encoding: Assigns an integer index to each category (e.g., Low=0, Medium=1, High=2).

    • Application: Ordinal data where rank/order is significant.
  • One-Hot Encoding: Creates binary columns for each category (e.g., Is_Red, Is_Blue).

  • Application: Nominal data where no intrinsic order exists to prevent the model from inferring false hierarchies.

# One-Hot Encoding implementation
df = pd.get_dummies(df, columns=['Category'])

B. Feature Scaling (Normalization)

Goal: To unify the scale of disparate features.

Without scaling, features with larger magnitudes (e.g., "Salary") will disproportionately influence the model's gradients compared to smaller features (e.g., "Age").

MethodUse Case
StandardizationOptimizers assuming Gaussian distribution (SVMs, Neural Networks).
Min-Max ScalingBounded ranges (0-1), often used in Image Processing.

$$x_{\text{scaled}} = \frac{x - \min(x)}{\max(x) - \min(x)}$$

$$z_{\text{score}} = \frac{x - \mu}{\sigma}$$

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df_scaled = scaler.fit_transform(df)

4. Data Reduction

Goal: To reduce the dataset's volume while maintaining its analytical integrity.

A. Dimensionality Reduction

Techniques like Principal Component Analysis (PCA) reduce the number of variables (features) by combining correlated features into principal components, retaining essential information with fewer columns.

B. Numerosity Reduction

Reducing the number of data points (rows) using methods like Sampling. This allows us to train models faster on a representative subset of the data.

C. Data Compression

Encoding data in a compact form to save storage and processing power.

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler

# 1. DATA CLEANING
# Handling Missing Values (Imputation)
df['Age'].fillna(df['Age'].mean(), inplace=True)

# Handling Duplicates
df.drop_duplicates(inplace=True)

# Handling Noisy Data (Binning)
# Grouping Age into 3 bins: Low, Medium, High
df['Age_Binned'] = pd.qcut(df['Age'], q=3, labels=['Low', 'Medium', 'High'])

# 2. DATA TRANSFORMATION
# Standardization (Z-Score)
scaler = StandardScaler()
df['Income_Scaled'] = scaler.fit_transform(df[['Income']])

# Normalization (Min-Max)
min_max = MinMaxScaler()
df['Score_Normalized'] = min_max.fit_transform(df[['Score']])

# Discretization (Mapping categorical text to numbers)
# Concept Hierarchy: Low=0, Medium=1, High=2
mapping = {'Low': 0, 'Medium': 1, 'High': 2}
df['Age_Ordinal'] = df['Age_Binned'].map(mapping)

print("Preprocessing Complete.")

5. Data Visualization

Goal: To leverage graphical representations for identifying latent patterns and correlations.

Visual inspection often reveals insights that tabular summaries miss.

  • Univariate Analysis:

    • Histograms/Distplots: To verify if the feature follows a Normal Distribution. (sns.histplot)

    • Boxplots: To visually identify outliers and spread. (sns.boxplot)

  • Bivariate Analysis:

    • Correlation Heatmap: Critical for Feature Selection. This matrix visualizes the Pearson correlation coefficient between variables, identifying collinearity or strong predictors.
import seaborn as sns
import matplotlib.pyplot as plt

# Generating a Correlation Matrix Heatmap
plt.figure(figsize=(10,8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.show()

Summary

Preprocessing is not an optional step; it is a requisite for high-performance models. A robust pipeline follows this sequence:

  1. Audit: Inspect structure and types.

  2. Clean: Impute missing values and remove duplicates.

  3. Encode: Convert categorical variables to numeric vectors.

  4. Scale: Normalize feature magnitudes.

  5. Visualize: Validate distributions and correlations.

With the data now sanitized and structured, we are prepared to proceed to model training.