Data Preprocessing 7 - Box - Whisker Plot
A Box and Whisker plot, commonly known as a box plot, is a standardized way of displaying the distribution of data based on a five-number summary: minimum, first quartile (Q1), median, third quartile (Q3), and maximum. It's an excellent tool for visualizing the spread and skewness of the data, as well as identifying outliers.
Box plot Structure
Here's how a box plot is typically structured:
The Box: Represents the interquartile range (IQR), which is
the range between the first quartile (Q1) and the third quartile (Q3). The box
shows the middle 50% of the data.
The Median Line: A line inside the box shows the median
(second quartile) of the dataset.
The Whiskers: Extend from the box to the highest and lowest
values within 1.5 times the IQR from the quartiles. These lines show the range
of the data.
Outliers: Data points outside the whiskers are often plotted
individually as dots and considered outliers.
Outliers
Python Code
Identify Outlier
Interpreting Spread and Skewness:
Symmetrically Distributed Data:
Spread: The spread of the data can be seen from the range of
the box (IQR) and whiskers. A larger IQR or longer whiskers indicate greater
variability in the data.
Skewness: In a symmetric distribution, the median line in
the box is centered, and the whiskers are roughly equal in length on both
sides. The distribution of data points within the box and outside the box
(whiskers) is roughly symmetric.
Asymmetrically Distributed Data:
Spread: Similar to the symmetric case, the spread is
indicated by the IQR and whisker lengths.
Skewness: Skewness is evident if the median line within the
box is off-center, and one set of whiskers is significantly longer than the
other. In a right-skewed distribution, the right whisker will be longer, and
the median will be closer to the left quartile. The opposite is true for a
left-skewed distribution.
These box plots provide a visual summary of the central
tendency, variability, and skewness of the data, allowing for an intuitive
understanding of the dataset's characteristics.
Python Code
Data-Science/box whisker.ipynb at main · lovelynrose/Data-Science (github.com)
Comments
Post a Comment