Data Preprocessing in Machine Learning
Basics of Data Preprocessing

A complete checklist for EDA, Cleaning, and Feature Engineering.
While neural architecture design often garners the most attention in Machine Learning, the efficacy of any model is fundamentally constrained by the quality of its input data. Real-world datasets are rarely pristine; they frequently contain inconsistencies, missing values, and formatting errors.
If compromised data is fed into a model, the output will inevitably be flawed a principle known as "Garbage In, Garbage Out."
This post outlines a systematic "preprocessing pipeline" I have established for ensuring data integrity and readiness, utilizing Python and the Pandas library.
1. Exploratory Data Analysis (EDA)
Goal: To audit the dataset's structure, quality, and underlying statistical properties before algorithmic application.
EDA is the diagnostic phase. It is essential to verify the integrity of the dataset before proceeding.
| Metric | Objective | Python Method |
| Shape & Structure | Assess dataset dimensionality (observations vs. features). | df.shape, df.head() |
| Data Types | Identify type mismatches (e.g., numeric values stored as strings). | df.info(), df.dtypes |
| Statistical Summary | Analyze central tendencies (mean, median) and dispersion. | df.describe() |
| Cardinality | Determine the number of unique values in categorical features. | df['col'].nunique() |
Technical Note: Executing
df.info()is the primary step. A common issue is numeric columns being cast as "Objects" due to the presence of non-numeric characters (e.g., currency symbols or corrupted entries).
2. Data Cleaning
Goal: To resolve anomalies and errors that would otherwise cause model instability or convergence failure.

A. Handling Missing Values (Imputation)
There are three primary strategies for addressing Null/NaN values:
Deletion: Removing rows with excessive missing data. (
df.dropna())Statistical Imputation:
Numerical: Replacing missing values with the Mean or Median.
Categorical: Replacing missing values with the Mode (frequency).
Model-Based Imputation: Utilizing algorithms (e.g., KNN Imputer) to infer missing values based on correlations.
from sklearn.impute import SimpleImputer
# Imputing missing 'Age' values with the Median to remain robust against outliers
imputer = SimpleImputer(strategy='median')
df['Age'] = imputer.fit_transform(df[['Age']])
B. Handling Duplicates
Duplicate observations can artificially inflate the weight of specific data points, leading to overfitting.
- Resolution:
df.drop_duplicates()
C. Handling Noisy Data
Noise refers to irrelevant or incorrect data that is difficult for machines to interpret. We use three techniques to smooth it out:
Binning: Sorting data into segments (bins) and smoothing them by replacing values with the bin mean or boundary values.
Regression: Fitting data to a linear or multiple regression function to smooth out fluctuations.
Clustering: Grouping similar data points; outliers that fall outside these clusters are identified as noise and removed.
D. Outlier Detection and Removal
Extreme deviations can significantly skew statistical parameters (like Mean and Variance), particularly in sensitive models like Linear Regression.
Z-Score Method: Filters data points exceeding $\pm3$ Standard Deviations from the mean.
IQR Method: Filters data falling outside the Interquartile Range.
$$Q1 - 1.5 \times IQR$$
$$Q3 + 1.5 \times IQR$$
2. Data Integration
Goal: To merge data from various sources into a single, unified dataset.
In real-world systems, data rarely lives in one CSV file. It sits in SQL databases, JSON logs, and third-party APIs.
Record Linkage: The process of identifying records from different datasets that refer to the same entity (e.g., matching "J. Smith" in Sales with "John Smith" in HR).
Data Fusion: Combining data from multiple sources to create a richer dataset, resolving inconsistencies (e.g., one source says "Male," another says "M").
3. Feature Engineering and Transformation
Goal: To translate raw data into a numerical format compatible with mathematical optimization.
A. Categorical Encoding
Machine Learning models require numerical input. Textual labels must be mapped to vector representations.
Label Encoding: Assigns an integer index to each category (e.g., Low=0, Medium=1, High=2).
- Application: Ordinal data where rank/order is significant.
One-Hot Encoding: Creates binary columns for each category (e.g.,
Is_Red,Is_Blue).Application: Nominal data where no intrinsic order exists to prevent the model from inferring false hierarchies.
# One-Hot Encoding implementation
df = pd.get_dummies(df, columns=['Category'])
B. Feature Scaling (Normalization)
Goal: To unify the scale of disparate features.
Without scaling, features with larger magnitudes (e.g., "Salary") will disproportionately influence the model's gradients compared to smaller features (e.g., "Age").
| Method | Use Case |
| Standardization | Optimizers assuming Gaussian distribution (SVMs, Neural Networks). |
| Min-Max Scaling | Bounded ranges (0-1), often used in Image Processing. |
$$x_{\text{scaled}} = \frac{x - \min(x)}{\max(x) - \min(x)}$$
$$z_{\text{score}} = \frac{x - \mu}{\sigma}$$
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df)
4. Data Reduction
Goal: To reduce the dataset's volume while maintaining its analytical integrity.
A. Dimensionality Reduction
Techniques like Principal Component Analysis (PCA) reduce the number of variables (features) by combining correlated features into principal components, retaining essential information with fewer columns.
B. Numerosity Reduction
Reducing the number of data points (rows) using methods like Sampling. This allows us to train models faster on a representative subset of the data.
C. Data Compression
Encoding data in a compact form to save storage and processing power.
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# 1. DATA CLEANING
# Handling Missing Values (Imputation)
df['Age'].fillna(df['Age'].mean(), inplace=True)
# Handling Duplicates
df.drop_duplicates(inplace=True)
# Handling Noisy Data (Binning)
# Grouping Age into 3 bins: Low, Medium, High
df['Age_Binned'] = pd.qcut(df['Age'], q=3, labels=['Low', 'Medium', 'High'])
# 2. DATA TRANSFORMATION
# Standardization (Z-Score)
scaler = StandardScaler()
df['Income_Scaled'] = scaler.fit_transform(df[['Income']])
# Normalization (Min-Max)
min_max = MinMaxScaler()
df['Score_Normalized'] = min_max.fit_transform(df[['Score']])
# Discretization (Mapping categorical text to numbers)
# Concept Hierarchy: Low=0, Medium=1, High=2
mapping = {'Low': 0, 'Medium': 1, 'High': 2}
df['Age_Ordinal'] = df['Age_Binned'].map(mapping)
print("Preprocessing Complete.")
5. Data Visualization
Goal: To leverage graphical representations for identifying latent patterns and correlations.
Visual inspection often reveals insights that tabular summaries miss.
Univariate Analysis:
Histograms/Distplots: To verify if the feature follows a Normal Distribution. (
sns.histplot)Boxplots: To visually identify outliers and spread. (
sns.boxplot)
Bivariate Analysis:
- Correlation Heatmap: Critical for Feature Selection. This matrix visualizes the Pearson correlation coefficient between variables, identifying collinearity or strong predictors.
import seaborn as sns
import matplotlib.pyplot as plt
# Generating a Correlation Matrix Heatmap
plt.figure(figsize=(10,8))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.show()
Summary
Preprocessing is not an optional step; it is a requisite for high-performance models. A robust pipeline follows this sequence:
Audit: Inspect structure and types.
Clean: Impute missing values and remove duplicates.
Encode: Convert categorical variables to numeric vectors.
Scale: Normalize feature magnitudes.
Visualize: Validate distributions and correlations.
With the data now sanitized and structured, we are prepared to proceed to model training.




