Data Preprocessing 8 - Data Normalization with Solved Example

Normalization techniques are used in data preprocessing to adjust the scale of data without distorting differences in the ranges of values. Here are several commonly used normalization techniques:

1. Min-Max Normalization:

- Rescales the feature to a fixed range, typically [0, 1].

- Formula: (x – min)/(max – min)

where min and max are the minimum and maximum values in the feature, respectively.

- Suitable for cases where you need to bound values but don't have outliers.

2. Z-score Normalization (Standardization):

- Transforms the feature to have a mean of 0 and a standard deviation of 1.

- Formula: (x – μ)/σ

μ is the mean and σ is the standard deviation.

- Particularly useful when the data follows a Gaussian distribution.

Solved problem is in the blog on z-score.

3. Decimal Scaling:

- Moves the decimal point of values of the feature.

- The data is divided by 10 raised to the power of n, where n is the smallest integer such that the maximum absolute value of the normalized value is less than 1.

4. Mean Normalization:

- Similar to Z-score normalization but instead of dividing by standard deviation, it involves rescaling the range to [-1, 1].

- Formula(x – mean)/(max – min)

5. Unit Vector Normalization (L2 Normalization):

- Scaling the whole feature vector to a unit norm. This means scaling the vector so that the sum of squares is 1.

- Often used in text classification and clustering.

6. Robust Scaling:

- Uses the median and the Interquartile Range (IQR) instead of the mean and standard deviation.

- Effective for datasets with outliers, as it uses statistics that are robust to outliers.

Each of these techniques can be chosen based on the specific needs of your dataset and the model you're using. For example, Min-Max Normalization and Z-score Normalization are very common in machine learning tasks, whereas Decimal Scaling might be more specific to certain types of numerical computations where scaling the magnitude of values is necessary.

Example

Dataset `[4, 8, 15, 16, 23, 42, 108]`

1. Min-Max Normalization:

- Min Value: 4

- Max Value: 108

- Normalized Data: `[0.0, 0.038, 0.106, 0.115, 0.183, 0.365, 1.0]`

2. Decimal Scaling:

- Scaling Factor (n): 3 (since 108 is the max value, and 10^3 is the smallest power of 10 greater than 108)

- Scaled Data: `[0.004, 0.008, 0.015, 0.016, 0.023, 0.042, 0.108]`

3. Mean Normalization:

- Mean: 30.857

- Range (Max - Min): 104

- Normalized Data: `[-0.258, -0.220, -0.152, -0.143, -0.076, 0.107, 0.742]`

4. Unit Vector Normalization (L2 Normalization):

- L2 Norm: 120.491

- Normalized Data: `[0.033, 0.066, 0.124, 0.133, 0.191, 0.349, 0.896]`

5. Robust Scaling:

- Median: 16

- IQR (Q3 - Q1): 21

- Scaled Data: `[-0.571, -0.381, -0.048, 0.0, 0.333, 1.238, 4.381]`

Each method provides a different way of scaling and transforming the data:

- Min-Max normalization rescales the data within the range [0, 1].

- Decimal scaling reduces the magnitude of data.

- Mean normalization centers the data around zero.

- L2 normalization scales the data to unit length.

- Robust scaling reduces the effects of outliers and scales data based on the median and IQR.

Python Code with np

import numpy as np
from sklearn.preprocessing import RobustScaler
from numpy.linalg import norm

# Example data
data = np.array([4, 8, 15, 16, 23, 42, 108])

# Min-Max Normalization
min_max_normalized = (data - data.min()) / (data.max() - data.min())
print("Min-Max Normalization:", min_max_normalized)

# Z-score Normalization
mean = np.mean(data)
std_dev = np.std(data)
z_score_normalized_np = (data - mean) / std_dev
print("Z-score Normalization (using NumPy):", z_score_normalized_np)

# Decimal Scaling
n = np.ceil(np.log10(data.max()))
decimal_scaled = data / (10**n)
print("Decimal Scaling:", decimal_scaled)

# Mean Normalization
mean_normalized = (data - data.mean()) / (data.max() - data.min())
print("Mean Normalization:", mean_normalized)

# Unit Vector Normalization (L2 Normalization)
l2_normalized = data / norm(data)
print("L2 Normalization:", l2_normalized)

# Robust Scaling
scaler = RobustScaler()
robust_scaled = scaler.fit_transform(data.reshape(-1, 1)).flatten()
print("Robust Scaling:", robust_scaled)

Python Code with in-built Functions

from sklearn.preprocessing import MinMaxScaler, StandardScaler, Normalizer, RobustScaler

# Example data reshaped for Scikit-Learn
data = [[4], [8], [15], [16], [23], [42], [108]]  

# Min-Max Normalization
min_max_scaler = MinMaxScaler()
min_max_normalized = min_max_scaler.fit_transform(data)
print("Min-Max Normalization:\n", min_max_normalized.flatten())

# Mean Normalization (using StandardScaler for centering and then rescaling)
mean_scaler = StandardScaler(with_std=False)
mean_normalized = mean_scaler.fit_transform(data)
mean_normalized = (mean_normalized - mean_normalized.min()) / (mean_normalized.max() - mean_normalized.min()) * 2 - 1
print("Mean Normalization:\n", mean_normalized.flatten())

# L2 Normalization
l2_scaler = Normalizer()
l2_normalized = l2_scaler.fit_transform(data)
print("L2 Normalization:\n", l2_normalized.flatten())

# Robust Scaling
robust_scaler = RobustScaler()
robust_scaled = robust_scaler.fit_transform(data)
print("Robust Scaling:\n", robust_scaled.flatten())

# Z-score Normalization
z_score_scaler = StandardScaler()
z_score_normalized = z_score_scaler.fit_transform(data)
print("Z-score Normalization:\n", z_score_normalized.flatten())

Note : Decimal scaling and custom transformations should be done with numpy.

Python Code

Data-Science/data normalization.ipynb at main · lovelynrose/Data-Science (github.com)

Search This Blog

Data Science - Programming