Data Preprocessing 6 - z score
The Z-score is a statistical measurement that describes a
value's relationship to the mean of a group of values, measured in terms of
standard deviations from the mean.
Benefits of Z-score:
Standardization: It brings different scales to a common
standard, making it easier to compare different features.
Handling Outliers: Helps in reducing the impact of outliers.
Requirement for Many Machine Learning Algorithms: Especially
useful in algorithms that are sensitive to the scale of data, like k-NN, SVM,
and neural networks.
Standardization
Z-score normalization, also known as standardization, is a
technique used to scale data so that it has a mean of 0 and a standard
deviation of 1. This process involves transforming each data point in your
dataset by subtracting the mean and then dividing by the standard deviation.
When we perform Z-score normalization (or standardization)
on a dataset, we are essentially transforming the data so that it has a mean of
0 and a standard deviation of 1. Here's why this happens:
Mean Becomes 0:
By subtracting the mean from each data point, we shift the
entire distribution so that the mean is aligned at 0. This process is called
centering the data.
Mathematically, if you calculate the mean of the transformed
dataset, it should be extremely close to 0, within a small margin of error due
to floating-point arithmetic.
Standard Deviation Becomes 1:
By dividing each data point (after it has been centered) by
the standard deviation, we scale the data so that its distribution has a
standard deviation of 1.
This process is called scaling the data. It ensures that the
variance of the dataset is 1, and since the standard deviation is the square
root of the variance, the standard deviation also becomes 1.
Calculating Z-Score of a data points
Outlier Detection
Preferred for anomaly detection based on practice. After performing z-score,
point > mean + 3 * SD
point < mean - 3 * SD
Comments
Post a Comment