Posts

Data Preprocessing 9 - Data Transformation

Data transformation is a crucial step in data preprocessing, especially in the context of machine learning and statistical analysis. The goal is to convert raw data into a format that is more appropriate for analysis. Several techniques are used for this purpose, each suited for different scenarios and types of data: 1. Normalization/Min-Max Scaling:    - Scales the data to fit within a specific range, typically 0 to 1, or -1 to 1.    - Useful when you need to bound values but don't have outliers. 2. Standardization/Z-score Normalization:    - Transforms data to have a mean of 0 and a standard deviation of 1.    - Suitable for cases where the data follows a Gaussian distribution. 3. Log Transformation:    - Applies the natural logarithm to the data.    - Effective for dealing with skewed data and making it more normally distributed. 4. Box-Cox Transformation:    - A generalized form of log transf...

Data Preprocessing 8 - Data Normalization with Solved Example

 Normalization techniques are used in data preprocessing to adjust the scale of data without distorting differences in the ranges of values . Here are several commonly used normalization techniques: 1. Min-Max Normalization:    - Rescales the feature to a fixed range, typically [0, 1].    - Formula: (x – min)/(max – min) where min and max are the minimum and maximum values in the feature, respectively.    - Suitable for cases where you need to bound values but don't have outliers. 2. Z-score Normalization (Standardization):    - Transforms the feature to have a mean of 0 and a standard deviation of 1.    - Formula: (x – μ )/ σ μ   is the mean and σ is the standard deviation.    - Particularly useful when the data follows a Gaussian distribution. Solved problem is in the blog on z-score. 3. Decimal Scaling:    - Moves the decimal point of values of the feature.    - The d...

Data Preprocessing 7 - Box - Whisker Plot

Image
A Box and Whisker plot, commonly known as a box plot, is a standardized way of displaying the distribution of data based on a five-number summary: minimum, first quartile (Q1), median, third quartile (Q3), and maximum. It's an excellent tool for visualizing the spread and skewness of the data, as well as identifying outliers. Box plot Structure Here's how a box plot is typically structured: The Box: Represents the interquartile range (IQR), which is the range between the first quartile (Q1) and the third quartile (Q3). The box shows the middle 50% of the data. The Median Line: A line inside the box shows the median (second quartile) of the dataset. The Whiskers: Extend from the box to the highest and lowest values within 1.5 times the IQR from the quartiles. These lines show the range of the data. Outliers: Data points outside the whiskers are often plotted individually as dots and considered outliers. Outliers point > Q3 + 1.5 * IQR point < Q1 - 1.5 * IQR P...

Data Preprocessing 6 - z score

The Z-score is a statistical measurement that describes a value's relationship to the mean of a group of values , measured in terms of standard deviations from the mean. Benefits of Z-score: Standardization : It brings different scales to a common standard, making it easier to compare different features. Handling Outliers : Helps in reducing the impact of outliers. Requirement for Many Machine Learning Algorithms : Especially useful in algorithms that are sensitive to the scale of data, like k-NN, SVM, and neural networks. Standardization Z-score normalization, also known as standardization , is a technique used to scale data so that it has a mean of 0 and a standard deviation of 1. This process involves transforming each data point in your dataset by subtracting the mean and then dividing by the standard deviation. When we perform Z-score normalization (or standardization) on a dataset, we are essentially transforming the data so that it has a mean of 0 and a standar...

Data Preprocessing 5 - Mean Subtraction - Data Centering

  Mean subtraction, also known as centering, is a data preprocessing technique where the mean of the entire feature set is subtracted from each data point. This process has several advantages, particularly in data analysis, statistics, and machine learning: 1. Removes Bias:  Mean subtraction removes the average value from your data, centering it around zero. This can help in removing any existing bias towards higher or lower values. 2. Improves Algorithm Performance:  Many machine learning algorithms perform better or converge faster when the features are centered. For instance, in gradient descent algorithms, centering can speed up the learning process. 3. Facilitates Feature Comparison:  When features are centered, it's easier to compare their scales and variances. This is especially useful in multivariate analyses where you're comparing different features of possibly different units and scales. 4. Enhances Numerical Stability:  Centering can improve t...

Data Preprocessing 4 - IQR

Image
Solving a Small Dataset Manually Let's consider a small dataset: `[4, 8, 15, 16, 23, 42, 108]` Using the IQR Method: 1. Arrange data in ascending order: `[4, 8, 15, 16, 23, 42, 108]` 2. Calculate Q1 (25th percentile) and Q3 (75th percentile). Calculating the first quartile (Q1) and the third quartile (Q3) manually involves a few steps, especially if the dataset is not already sorted.    Steps to Calculate Q1 and Q3 Manually: 1. Sort the Data (if not already sorted):    - Dataset: `[4, 8, 15, 16, 23, 42, 108]` (already sorted) 2. Find the Median (Second Quartile or Q2):    - This dataset has 7 elements.    - The median is the middle value, so for our dataset, it's the fourth number: `16`. 3. Find Q1 (First Quartile):    - Q1 is the median of the first half of the data (excluding the median if the number of data points is odd).    - First half of the data (excluding 16): `[4, 8, 15]`    - S...

Data Preprocessing 3 - Outlier Detection

Identifying and removing outliers is a crucial step in data preprocessing, especially in statistical analyses and machine learning models, as outliers can significantly skew results. Techniques to Identify Outliers 1. Statistical Methods:    - Z-Score: A Z-score measures the number of standard deviations an element is from the mean. Typically, a threshold of Z > 3 or Z < -3 is used to identify outliers. Point > Mean + 3 * SD Point < Mean - 3 * SD In practice this is found to be good.    - IQR (Interquartile Range) Method: This method involves calculating the IQR, which is the difference between the 75th and 25th percentiles (Q3 and Q1). Outliers are often considered to be any data points that lie 1.5 IQRs below Q1 or above Q3. Point > Q3 + 1.5 * IQR Point < Q1 - 1.5 * IQR 2. Visualization Techniques:    - Box Plots: These plots provide a visual summary of the central tendency, dispersion, and skewness of the data and...