Data Preprocessing 3 - Outlier Detection

Identifying and removing outliers is a crucial step in data preprocessing, especially in statistical analyses and machine learning models, as outliers can significantly skew results.

Techniques to Identify Outliers

1. Statistical Methods:

   - Z-Score:

A Z-score measures the number of standard deviations an element is from the mean. Typically, a threshold of Z > 3 or Z < -3 is used to identify outliers.

Point > Mean + 3 * SD

Point < Mean - 3 * SD

In practice this is found to be good.

   - IQR (Interquartile Range) Method:

This method involves calculating the IQR, which is the difference between the 75th and 25th percentiles (Q3 and Q1). Outliers are often considered to be any data points that lie 1.5 IQRs below Q1 or above Q3.

Point > Q3 + 1.5 * IQR

Point < Q1 - 1.5 * IQR

2. Visualization Techniques:

   - Box Plots:

These plots provide a visual summary of the central tendency, dispersion, and skewness of the data and are great for spotting outliers.

   - Scatter Plots:

Useful for identifying outliers in bivariate data.

3. Domain-Specific Methods: 

Depending on the context, certain values might be considered outliers due to domain-specific knowledge.

Python Code

Comments

Popular posts from this blog

Data Preprocessing 1 - Key Steps

Data Preprocessing 2 - Data Imputation

Python Libraries for Time-Series Forecasting