Data Preprocessing 6 - z score

The Z-score is a statistical measurement that describes a value's relationship to the mean of a group of values, measured in terms of standard deviations from the mean.

Benefits of Z-score:

Standardization: It brings different scales to a common standard, making it easier to compare different features.

Handling Outliers: Helps in reducing the impact of outliers.

Requirement for Many Machine Learning Algorithms: Especially useful in algorithms that are sensitive to the scale of data, like k-NN, SVM, and neural networks.

Standardization

Z-score normalization, also known as standardization, is a technique used to scale data so that it has a mean of 0 and a standard deviation of 1. This process involves transforming each data point in your dataset by subtracting the mean and then dividing by the standard deviation.

When we perform Z-score normalization (or standardization) on a dataset, we are essentially transforming the data so that it has a mean of 0 and a standard deviation of 1. Here's why this happens:

Mean Becomes 0:

By subtracting the mean from each data point, we shift the entire distribution so that the mean is aligned at 0. This process is called centering the data.

Mathematically, if you calculate the mean of the transformed dataset, it should be extremely close to 0, within a small margin of error due to floating-point arithmetic.

Standard Deviation Becomes 1:

By dividing each data point (after it has been centered) by the standard deviation, we scale the data so that its distribution has a standard deviation of 1.

This process is called scaling the data. It ensures that the variance of the dataset is 1, and since the standard deviation is the square root of the variance, the standard deviation also becomes 1.

Calculating Z-Score of a data points


Outlier Detection

Preferred for anomaly detection based on practice. After performing z-score,

point > mean + 3 * SD

point < mean - 3 * SD

Python Code

Comments

Popular posts from this blog

Data Preprocessing 1 - Key Steps

Data Preprocessing 2 - Data Imputation

Python Libraries for Time-Series Forecasting