Visualizing Data — Matplotlib and Seaborn

Why Stats Aren't Enough

I thought .describe() was enough. It gave me the mean, the max, and the median. But then I learned about Anscombe's Quartet.

This is a famous dataset where four different groups of data have the exact same statistical properties (same mean, same variance), but when you plot them, they look completely different. One is a curve, one is a line, and one is just noise.

I used to think visualization was just about making charts look pretty for presentations. I was wrong. In Machine Learning, visualization is a debugging tool.

If I feed raw numbers into a Neural Network without looking at them, I am flying blind.

Is my data skewed? (My model will fail).
Are there outliers? (My model will get confused).
Is there a pattern? (If I can't see it, the model might not either).

Today,lets do Matplotlib (for control) and Seaborn (for statistics).

1. Matplotlib

Matplotlib is the grandfather of Python visualization. It is powerful, customizable, and honestly... a bit verbose. It gives you control over every single pixel.

The Hierarchy (Figure vs. Axes)

This was the hardest concept to grasp. I kept copying code that used plt.plot() and other code that used ax.plot().

Here is the mental model:

The Figure (fig): The blank canvas or the window.
The Axes (ax): The actual plot inside the canvas (contains the x-axis, y-axis, lines, etc.).

The Professional Way (Object-Oriented): Instead of using the global plt state (which gets messy with complex plots), we explicitly create figure and axes objects.

import matplotlib.pyplot as plt
import numpy as np

x = np.linspace(0, 10, 100)
y = np.sin(x)

# Create a figure and a single subplot
fig, ax = plt.subplots()

# Plot on the axes object
ax.plot(x, y, color='blue', linestyle='--', label='Sine Wave')
ax.set_title("My First Signal")
ax.set_xlabel("Time")
ax.legend()

plt.show()

The Concept:

To plot a curve, we don't draw a line. We generate hundreds of tiny dots (x, y) coordinates and connect them.

Key Customizations:

Overlapping: To plot two waves on the same graph, simply call .plot() twice on the same axis. Matplotlib automatically overlays them.
Markers: Use marker='+' or marker='.' to see the actual data points.
Colors & Styles: Use color='green' (or 'g') and linestyle='--' (dashed).

The Sine Wave

Goal: Plot a standard wave function with distinct markers.

Style: Red color, dashed line (--), plus markers (+).

import matplotlib.pyplot as plt
import numpy as np

# 1. Generate Data
x = np.linspace(-10, 10, 100)
y_sin = np.sin(x)

# 2. Plot Sine
plt.figure(figsize=(8, 4))
plt.plot(x, y_sin, label='Sine Wave', color='red', marker='+', linestyle='--')

# 3. Decorate
plt.title("Sine Wave Function")
plt.xlabel("Input (x)")
plt.ylabel("Output (sin(x))")
plt.axhline(0, color='black', linewidth=1)
plt.legend()
plt.grid(True)
plt.show()

The Cosine Wave

Goal: Plot a cosine wave (which is just a shifted sine wave).

Style: Blue color, dotted line (:), dot markers (.).

# 1. Generate Data
x = np.linspace(-10, 10, 100)
y_cos = np.cos(x)

# 2. Plot Cosine
plt.figure(figsize=(8, 4))
plt.plot(x, y_cos, label='Cosine Wave', color='blue', marker='.', linestyle=':')

# 3. Decorate
plt.title("Cosine Wave Function")
plt.xlabel("Input (x)")
plt.ylabel("Output (cos(x))")
plt.axhline(0, color='black', linewidth=1)
plt.legend()
plt.grid(True)
plt.show()

The Parabola (y = x²)

Goal: Plot a non-linear exponential function. This shape is crucial because "Loss Functions" in Neural Networks look exactly like this (we try to find the bottom of the curve).

Style: Green color, solid line.

# 1. Generate Data
x = np.linspace(-10, 10, 100)
y_parabola = x**2  # No scaling needed anymore!

# 2. Plot Parabola
plt.figure(figsize=(8, 4))
plt.plot(x, y_parabola, label='Parabola (y=x^2)', color='green')

# 3. Decorate
plt.title("Parabola (Quadratic Function)")
plt.xlabel("Input (x)")
plt.ylabel("Output (x^2)")
plt.axhline(0, color='black', linewidth=1)
plt.axvline(0, color='black', linewidth=1) # Vertical line at 0 helps visual center
plt.legend()
plt.grid(True)
plt.show()

Pro Tip: Using Subplots

If you ever do want to see them all at once—but neatly separated—you use plt.subplots. This creates a grid of distinct graphs instead of overlapping them.

The Concept: We use Matplotlib to build the "frame" (the grid of empty boxes) and then use Seaborn to "paint" inside each specific box.

The Dashboard (Side-by-Side)

This creates a layout with 1 Row and 2 Columns.

import matplotlib.pyplot as plt
import seaborn as sns

# 1. Load Data
tips = sns.load_dataset('tips')
sns.set_theme(style="whitegrid")

# 2. Create the Grid (The "Figure")
# nrows=1, ncols=2 means side-by-side
# figsize=(15, 5) makes it wide
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# 3. Plot 1 (Left Side)
# ax=axes[0] puts this plot in the first box
sns.histplot(data=tips, x="total_bill", kde=True, ax=axes[0], color="teal")
axes[0].set_title("Distribution of Bill Amounts")

# 4. Plot 2 (Right Side)
# ax=axes[1] puts this plot in the second box
sns.boxplot(data=tips, x="day", y="total_bill", ax=axes[1], palette="Set2")
axes[1].set_title("Outliers by Day")

# 5. Clean up layout
plt.tight_layout() # Essential command! Prevents overlapping labels.
plt.show()

How to Scale It (2x2 Grid)

If you want 4 plots (2 Rows, 2 Columns), the indexing changes slightly to a 2D array: axes[row, column].

# Create 2 Rows, 2 Columns
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Top Left: axes[0, 0]
sns.histplot(data=tips, x="total_bill", ax=axes[0, 0])

# Top Right: axes[0, 1]
sns.boxplot(data=tips, x="day", y="total_bill", ax=axes[0, 1])

# Bottom Left: axes[1, 0]
sns.scatterplot(data=tips, x="total_bill", y="tip", ax=axes[1, 0])

# Bottom Right: axes[1, 1]
sns.countplot(data=tips, x="day", ax=axes[1, 1])

plt.tight_layout()
plt.show()

2. Seaborn

If Matplotlib is "driving manual," Seaborn is a Tesla on Autopilot.

While Matplotlib is for math, Seaborn is for Data Analysis. It takes a Pandas DataFrame and summarizes complex statistics instantly.

1. Relational Plots: How variables interact

A. Scatter Plot (sns.scatterplot)

The Theory: This is the most fundamental plot in science. It maps one variable to the X-axis and another to the Y-axis to reveal Correlation.
When to use: When you want to verify relationships. (e.g., "As study hours increase, do grades increase?").
Advanced Feature: The hue parameter adds a 3rd dimension (color) and size adds a 4th dimension (dot size).

import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

# Load a built-in dataset to practice
df = sns.load_dataset('tips')

# Matplotlib would take 5 lines to do this. Seaborn takes 1.
# The 'hue' parameter adds a 3rd dimension (color) automatically
sns.scatterplot(data=df, x='total_bill', y='tip', hue='Sex')

plt.title("Tips vs Total Bill")
plt.show()

B. Line Plot (sns.lineplot)

The Theory: Similar to scatter, but connects the dots. It assumes the X-axis represents a sequence (like Time).
When to use: For Time-Series data (Stock prices, temperature over a week, model accuracy over training epochs).

2. Distribution Plots: The Shape of Data

A. Histogram (sns.histplot)

The Theory: It chops the data into "bins" (buckets) and counts how many data points fall into each bucket.

Neural Networks assume data is "Normally Distributed" (Bell Curve). A histogram instantly tells you if your data is Skewed (leaning left or right). If it's skewed, you might need to fix it with math (Log Transform) before training.

import seaborn as sns
import matplotlib.pyplot as plt

# Load dataset
tips = sns.load_dataset('tips')

plt.figure(figsize=(7, 5))

# kde=True adds the smooth curve (Kernel Density Estimate)
# bins=20 controls how many "buckets" we chop the data into
sns.histplot(data=tips, x="total_bill", kde=True, bins=20, color="skyblue")

plt.title("Histogram: Distribution of Total Bill")
plt.show()

B. Box Plot (sns.boxplot)

The Theory: This is pure statistics.
- The Box: Represents the Interquartile Range (IQR)—the middle 50% of your data.
- The Line: The Median (50th percentile).
- The Whiskers: The range of "normal" data (usually 1.5x the IQR).
- The Diamonds: Outliers.
When to use: To detect anomalies. If you see points outside the whiskers, you must investigate them.

plt.figure(figsize=(7, 5))

# x=category, y=numerical
# palette="Set2" changes the color scheme
sns.boxplot(data=tips, x="day", y="total_bill", palette="Set2")

plt.title("Box Plot: Spotting Outliers by Day")
plt.show()

C. Violin Plot (sns.violinplot)

The Theory: A Box Plot is a square. A Violin Plot is the same thing, but it uses a curved line (KDE) to show the density of data at different values.
When to use: When a Box Plot hides important details. For example, if your data has two peaks (bimodal), a box plot looks normal, but a violin plot will clearly show the "two bumps."

plt.figure(figsize=(7, 5))

# inner="quartile" draws lines inside the violin showing the median and IQR
sns.violinplot(data=tips, x="day", y="total_bill", palette="pastel", inner="quartile")

plt.title("Violin Plot: Density & Shape")
plt.show()

Pro Tip: Plotting them Side-by-Side

To really understand the difference, it helps to see them next to each other.

# Create a dashboard with 1 row and 3 columns
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

# Plot 1: Histogram
sns.histplot(data=tips, x="total_bill", kde=True, ax=axes[0], color="skyblue")
axes[0].set_title("Histogram (Distribution)")

# Plot 2: Box Plot
sns.boxplot(data=tips, x="day", y="total_bill", ax=axes[1], palette="Set2")
axes[1].set_title("Box Plot (Outliers)")

# Plot 3: Violin Plot
sns.violinplot(data=tips, x="day", y="total_bill", ax=axes[2], palette="pastel")
axes[2].set_title("Violin Plot (Density)")

plt.tight_layout()
plt.show()

3. Categorical Plots: Comparing Groups

A. Bar Plot (sns.barplot)

The Theory: It calculates a summary statistic (usually the Mean) for different categories.
Important: The little black line on top is the Error Bar (Confidence Interval). It tells you how uncertain the calculated mean is.
When to use: "What is the average Salary for Engineers vs. Doctors?"

import seaborn as sns
import matplotlib.pyplot as plt

# Load dataset
tips = sns.load_dataset('tips')

plt.figure(figsize=(7, 5))

# By default, barplot calculates the MEAN of the y-variable
# hue="smoker" splits the bars further by adding a sub-category
sns.barplot(data=tips, x="sex", y="total_bill", hue="smoker", palette="muted")

plt.title("Bar Plot: Average Bill by Gender & Smoker Status")
plt.ylabel("Average Total Bill ($)")
plt.show()

B. Count Plot (sns.countplot)

The Theory: It strictly counts how many times a category appears. It does not calculate an average.
When to use: To check for Class Imbalance. (e.g., "Do I have 1000 pictures of cats but only 5 pictures of dogs?").

plt.figure(figsize=(7, 5))

# Notice we ONLY provide 'x'. The 'y' is calculated automatically (the count).
sns.countplot(data=tips, x="day", hue="sex", palette="viridis")

plt.title("Count Plot: Customer Traffic by Day")
plt.ylabel("Number of Customers")
plt.show()

The Key Difference

If you are ever confused, remember this:

Bar Plot needs an x (Category) AND a y (Number to average).
Count Plot needs ONLY an x (Category to count).

4. Matrix & Advanced Plots

A. Heatmap (sns.heatmap)

The Theory: It turns a table of numbers into colors. We use it primarily for the Correlation Matrix.
The ML Context: If two input variables have a correlation of 1.0 (perfectly correlated), you usually drop one of them. This is called "Multicollinearity." The heatmap makes these spots glow red.

import seaborn as sns
import matplotlib.pyplot as plt

# Load dataset
tips = sns.load_dataset('tips')

plt.figure(figsize=(8, 6))

# 1. Calculate Correlation Matrix
# numeric_only=True is required in new Pandas versions to ignore text columns
corr_matrix = tips.corr(numeric_only=True)

# 2. Plot Heatmap
# annot=True writes the actual numbers in the boxes
# cmap='coolwarm' makes positive red and negative blue
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', linewidths=0.5)

plt.title("Correlation Heatmap")
plt.show()

B. Pair Plot (sns.pairplot)

The Theory: It runs a "Bird's Eye View" scan. It creates a grid where every variable is plotted against every other variable.
When to use: This is the first command I run on a new dataset to spot immediate patterns.

# Load the 'iris' dataset (classic for ML classification)
iris = sns.load_dataset('iris')

# This one line generates 16 subplots automatically!
# hue="species" colors every single plot by the flower type
sns.pairplot(iris, hue="species", palette="husl")

plt.show()

C. Joint Plot (sns.jointplot)

The Theory: A hybrid plot. It puts a Scatter plot in the center and Histograms on the top and right axes.
When to use: When you want detailed analysis of just two variables (Correlation + Distribution) at the same time.

# kind="reg" adds a Regression Line (Trend line) automatically
# kind="hex" creates a cool honeycomb density plot
sns.jointplot(data=tips, x="total_bill", y="tip", kind="reg", color="purple")

plt.show()

Heatmap - "Which features are redundant?" (Multicollinearity check)
Pair Plot- "I just got this data. Show me everything." (Initial scan)
Joint Plot - "Are 'Bill' and 'Tip' actually related, and is the data normal?" (Deep dive)

Visualizing Data — Matplotlib and Seaborn

Why Stats Aren't Enough

1. Matplotlib

The Hierarchy (Figure vs. Axes)

Pro Tip: Using Subplots

The Dashboard (Side-by-Side)

How to Scale It (2x2 Grid)

2. Seaborn

1. Relational Plots: How variables interact

2. Distribution Plots: The Shape of Data

Pro Tip: Plotting them Side-by-Side

3. Categorical Plots: Comparing Groups

4. Matrix & Advanced Plots

Comments

More from this blog

Day 14: setup_env.sh to human-learn — Building Leak-Proof Pipelines on M1

Data Wrangling — Cleaning, Merging, and Grouping

Pandas: A Deep Dive

The NumPy Survival Guide

Command Palette

Why Stats Aren't Enough

1. Matplotlib

The Hierarchy (Figure vs. Axes)

Pro Tip: Using Subplots

The Dashboard (Side-by-Side)

How to Scale It (2x2 Grid)

2. Seaborn

1. Relational Plots: How variables interact

2. Distribution Plots: The Shape of Data

Pro Tip: Plotting them Side-by-Side

3. Categorical Plots: Comparing Groups

4. Matrix & Advanced Plots

Comments

More from this blog