Data Preprocessing 7 - Box - Whisker Plot

A Box and Whisker plot, commonly known as a box plot, is a standardized way of displaying the distribution of data based on a five-number summary: minimum, first quartile (Q1), median, third quartile (Q3), and maximum. It's an excellent tool for visualizing the spread and skewness of the data, as well as identifying outliers.

Box plot Structure

Here's how a box plot is typically structured:

The Box: Represents the interquartile range (IQR), which is the range between the first quartile (Q1) and the third quartile (Q3). The box shows the middle 50% of the data.

The Median Line: A line inside the box shows the median (second quartile) of the dataset.

The Whiskers: Extend from the box to the highest and lowest values within 1.5 times the IQR from the quartiles. These lines show the range of the data.

Outliers: Data points outside the whiskers are often plotted individually as dots and considered outliers.

Outliers

point > Q3 + 1.5 * IQR
point < Q1 - 1.5 * IQR

Python Code

Identify Outlier

import seaborn as sns
import matplotlib.pyplot as plt

# Example data
data = [4, 8, 15, 16, 23, 42, 108]

# Creating the box plot
sns.boxplot(x=data)
plt.title('Box Plot')
plt.xlabel('Values')
plt.show()

Interpreting Spread and Skewness:

import numpy as np
import matplotlib.pyplot as plt

# Symmetrically distributed data (e.g., normal distribution)
symmetric_data = np.random.normal(loc=0, scale=1, size=1000)

# Asymmetrically distributed data (e.g., skewed distribution)
skewed_data = np.random.exponential(scale=1, size=1000)

plt.figure(figsize=(12, 6))

# Box plot for symmetrically distributed data
plt.subplot(1, 2, 1)
plt.boxplot(symmetric_data)
plt.title("Symmetrically Distributed Data")
plt.ylabel("Values")

# Box plot for asymmetrically distributed data
plt.subplot(1, 2, 2)
plt.boxplot(skewed_data)
plt.title("Asymmetrically Distributed Data")
plt.ylabel("Values")

plt.show()

Symmetrically Distributed Data:

Spread: The spread of the data can be seen from the range of the box (IQR) and whiskers. A larger IQR or longer whiskers indicate greater variability in the data.

Skewness: In a symmetric distribution, the median line in the box is centered, and the whiskers are roughly equal in length on both sides. The distribution of data points within the box and outside the box (whiskers) is roughly symmetric.

Asymmetrically Distributed Data:

Spread: Similar to the symmetric case, the spread is indicated by the IQR and whisker lengths.

Skewness: Skewness is evident if the median line within the box is off-center, and one set of whiskers is significantly longer than the other. In a right-skewed distribution, the right whisker will be longer, and the median will be closer to the left quartile. The opposite is true for a left-skewed distribution.

These box plots provide a visual summary of the central tendency, variability, and skewness of the data, allowing for an intuitive understanding of the dataset's characteristics.

Python Code

Data-Science/box whisker.ipynb at main · lovelynrose/Data-Science (github.com)

Comments

Popular posts from this blog

Data Preprocessing 1 - Key Steps

Data Preprocessing 2 - Data Imputation

Data Preprocessing 8 - Data Normalization with Solved Example