Data Preprocessing 3 - Outlier Detection
Identifying and removing outliers
is a crucial step in data preprocessing, especially in statistical analyses and
machine learning models, as outliers can significantly skew results.
Techniques to Identify Outliers
1. Statistical Methods:
- Z-Score:
A Z-score measures the number of
standard deviations an element is from the mean. Typically, a threshold of Z
> 3 or Z < -3 is used to identify outliers.
Point > Mean + 3 * SD
Point < Mean - 3 * SD
In practice this is found to be good.
- IQR
(Interquartile Range) Method:
This method involves calculating
the IQR, which is the difference between the 75th and 25th percentiles (Q3 and
Q1). Outliers are often considered to be any data points that lie 1.5 IQRs
below Q1 or above Q3.
Point > Q3 + 1.5 * IQR
Point < Q1 - 1.5 * IQR
2. Visualization Techniques:
- Box Plots:
These plots provide a visual
summary of the central tendency, dispersion, and skewness of the data and are
great for spotting outliers.
- Scatter Plots:
Useful for identifying outliers
in bivariate data.
3. Domain-Specific Methods:
Depending on the context, certain values might be considered outliers due to
domain-specific knowledge.
Comments
Post a Comment