Data Preprocessing 4 - IQR

Solving a Small Dataset Manually

Let's consider a small dataset: `[4, 8, 15, 16, 23, 42, 108]`

Using the IQR Method:

1. Arrange data in ascending order: `[4, 8, 15, 16, 23, 42, 108]`

2. Calculate Q1 (25th percentile) and Q3 (75th percentile).

Calculating the first quartile (Q1) and the third quartile (Q3) manually involves a few steps, especially if the dataset is not already sorted. 

 Steps to Calculate Q1 and Q3 Manually:

1. Sort the Data (if not already sorted):

   - Dataset: `[4, 8, 15, 16, 23, 42, 108]` (already sorted)

2. Find the Median (Second Quartile or Q2):

   - This dataset has 7 elements.

   - The median is the middle value, so for our dataset, it's the fourth number: `16`.

3. Find Q1 (First Quartile):

   - Q1 is the median of the first half of the data (excluding the median if the number of data points is odd).

   - First half of the data (excluding 16): `[4, 8, 15]`

   - Since there are 3 numbers, the median (and hence Q1) is the middle number: `8`.

4. Find Q3 (Third Quartile):

   - Q3 is the median of the second half of the data (excluding the median if the number of data points is odd).

   - Second half of the data (excluding 16): `[23, 42, 108]`

   - Since there are 3 numbers, the median (and hence Q3) is the middle number: `42`.

Therefore, for the dataset `[4, 8, 15, 16, 23, 42, 108]`, the first quartile (Q1) is `8`, and the third quartile (Q3) is `42`.

These calculations assume that you are using the median (or the second quartile) method for quartile calculation. There are different methods (like the Tukey or the Moore and McCabe method), and results can vary slightly depending on the method used.

5. Calculate IR

IQR = Q3 - Q1 

= 42 - 8 = 34

Calculating Quartiles with Interpolation as in np.percentile function

To manually calculate quartiles with interpolation and get results similar to those from `np.percentile`, we need to adjust our method slightly. 

 Steps for Manual Calculation with Interpolation

1. Sort the Data (if not already sorted):

   - Dataset: `[4, 8, 15, 16, 23, 42, 108]` (already sorted)

2. Calculate the Positions for Q1 and Q3 with Interpolation:

   - The position `P` for a given percentile in a dataset of size `n` is calculated using: 

3. Interpolate Between Data Points:

   - Since the positions are whole numbers, normally we would just take the values at these positions. However, `np.percentile` averages the nearest neighbors when the percentile falls exactly on an index. So for Q1 (2nd position), we average the 2nd and 3rd values: (8 + 15)/2 = 11.5.

   - Similarly, for Q3 (6th position), we average the 6th and 7th values: (42 + 108)/2 = 32.5.

4. Calculate IQR:

   - IQR = Q3 - Q1 = 32.5 - 11.5 = 21.0.

Identifying Outliers

1. Find the lower bound (Q1 - 1.5 * IQR) and upper bound (Q3 + 1.5 * IQR).

lower bound = 11.5 - 1.5*21 = -20

upper bound = 32.5 + 1.5*21 = 64

2. Identify any numbers outside this range as outliers.

108 lies outside the boundary and so it is an outlier.

Python Code : 

Data-Science/outliers.ipynb at main · lovelynrose/Data-Science (github.com)

Comments

Popular posts from this blog

Data Preprocessing 1 - Key Steps

Data Preprocessing 2 - Data Imputation

Python Libraries for Time-Series Forecasting