Data Preprocessing 8 - Data Normalization with Solved Example
Normalization techniques are used in data preprocessing to adjust the scale of data without distorting differences in the ranges of values. Here are several commonly used normalization techniques:
1. Min-Max Normalization:
- Rescales the
feature to a fixed range, typically [0, 1].
- Formula: (x –
min)/(max – min)
where min and max are the minimum and maximum values in the
feature, respectively.
- Suitable for
cases where you need to bound values but don't have outliers.
2. Z-score Normalization (Standardization):
- Transforms the
feature to have a mean of 0 and a standard deviation of 1.
- Formula: (x – μ)/σ
μ is the mean and σ is
the standard deviation.
- Particularly
useful when the data follows a Gaussian distribution.
Solved problem is in the blog on z-score.
3. Decimal Scaling:
- Moves the decimal
point of values of the feature.
- The data is
divided by 10 raised to the power of n, where n is the smallest integer such
that the maximum absolute value of the normalized value is less than 1.
4. Mean Normalization:
- Similar to
Z-score normalization but instead of dividing by standard deviation, it
involves rescaling the range to [-1, 1].
- Formula(x – mean)/(max
– min)
5. Unit Vector Normalization (L2 Normalization):
- Scaling the whole
feature vector to a unit norm. This means scaling the vector so that the sum of
squares is 1.
- Often used in
text classification and clustering.
6. Robust Scaling:
- Uses the median
and the Interquartile Range (IQR) instead of the mean and standard deviation.
- Effective for
datasets with outliers, as it uses statistics that are robust to outliers.
Each of these techniques can be chosen based on the specific
needs of your dataset and the model you're using. For example, Min-Max
Normalization and Z-score Normalization are very common in machine learning
tasks, whereas Decimal Scaling might be more specific to certain types of
numerical computations where scaling the magnitude of values is necessary.
Example
Dataset `[4, 8, 15, 16, 23, 42, 108]`
1. Min-Max Normalization:
- Min Value: 4
- Max Value: 108
- Normalized Data: `[0.0, 0.038, 0.106, 0.115, 0.183, 0.365, 1.0]`
2. Decimal Scaling:
- Scaling Factor
(n): 3 (since 108 is the max value, and 10^3 is the smallest power of 10
greater than 108)
- Scaled Data: `[0.004, 0.008, 0.015, 0.016, 0.023, 0.042, 0.108]`
3. Mean Normalization:
- Mean: 30.857
- Range (Max -
Min): 104
- Normalized Data: `[-0.258, -0.220, -0.152, -0.143, -0.076, 0.107, 0.742]`
4. Unit Vector Normalization (L2 Normalization):
- L2 Norm: 120.491
- Normalized Data: `[0.033, 0.066, 0.124, 0.133, 0.191, 0.349, 0.896]`
5. Robust Scaling:
- Median: 16
- IQR (Q3 - Q1): 21
- Scaled Data: `[-0.571, -0.381, -0.048, 0.0, 0.333, 1.238, 4.381]`
Each method provides a different way of scaling and transforming the data:
- Min-Max normalization rescales the data within the range
[0, 1].
- Decimal scaling reduces the magnitude of data.
- Mean normalization centers the data around zero.
- L2 normalization scales the data to unit length.
- Robust scaling reduces the effects of outliers and scales
data based on the median and IQR.
Python Code with np
Python Code with in-built Functions
Note : Decimal scaling and custom transformations should be done with numpy.
Python Code
Data-Science/data normalization.ipynb at main · lovelynrose/Data-Science (github.com)
Comments
Post a Comment