Data Preprocessing 3 - Outlier Detection

January 19, 2024

Identifying and removing outliers is a crucial step in data preprocessing, especially in statistical analyses and machine learning models, as outliers can significantly skew results.

Techniques to Identify Outliers

1. Statistical Methods:

- Z-Score:

A Z-score measures the number of standard deviations an element is from the mean. Typically, a threshold of Z > 3 or Z < -3 is used to identify outliers.

Point > Mean + 3 * SD

Point < Mean - 3 * SD

In practice this is found to be good.

- IQR (Interquartile Range) Method:

This method involves calculating the IQR, which is the difference between the 75th and 25th percentiles (Q3 and Q1). Outliers are often considered to be any data points that lie 1.5 IQRs below Q1 or above Q3.

Point > Q3 + 1.5 * IQR

Point < Q1 - 1.5 * IQR

2. Visualization Techniques:

- Box Plots:

These plots provide a visual summary of the central tendency, dispersion, and skewness of the data and are great for spotting outliers.

- Scatter Plots:

Useful for identifying outliers in bivariate data.

3. Domain-Specific Methods:

Depending on the context, certain values might be considered outliers due to domain-specific knowledge.

Python Code

Data-Science/outliers.ipynb at main · lovelynrose/Data-Science (github.com)

Search This Blog

Data Science - Programming