Data Preprocessing 4 - IQR
Solving a Small Dataset Manually
Let's consider a small dataset:
`[4, 8, 15, 16, 23, 42, 108]`
Using the IQR Method:
1. Arrange data in ascending
order: `[4, 8, 15, 16, 23, 42, 108]`
2. Calculate Q1 (25th
percentile) and Q3 (75th percentile).
Calculating the first quartile
(Q1) and the third quartile (Q3) manually involves a few steps, especially if
the dataset is not already sorted.
Steps to Calculate Q1 and Q3 Manually:
1. Sort the Data (if not already
sorted):
- Dataset: `[4, 8, 15, 16, 23, 42, 108]` (already sorted)
2. Find the Median (Second
Quartile or Q2):
- This dataset has 7 elements.
- The median is the middle value, so for our dataset, it's the fourth
number: `16`.
3. Find Q1 (First Quartile):
- Q1 is the median of the first half of the data (excluding the median
if the number of data points is odd).
- First half of the data (excluding 16): `[4, 8, 15]`
- Since there are 3 numbers, the median (and hence Q1) is the middle
number: `8`.
4. Find Q3 (Third Quartile):
- Q3 is the median of the second half of the data (excluding the median
if the number of data points is odd).
- Second half of the data (excluding 16): `[23, 42, 108]`
- Since there are 3 numbers, the median (and hence Q3) is the middle number: `42`.
Therefore, for the dataset `[4,
8, 15, 16, 23, 42, 108]`, the first quartile (Q1) is `8`, and the third
quartile (Q3) is `42`.
These calculations assume that
you are using the median (or the second quartile) method for quartile
calculation. There are different methods (like the Tukey or the Moore and
McCabe method), and results can vary slightly depending on the method used.
5. Calculate IR
IQR = Q3 - Q1
= 42 - 8 = 34
Calculating Quartiles with Interpolation as in np.percentile function
To manually calculate quartiles with interpolation and get results similar to those from `np.percentile`, we need to adjust our method slightly.
Steps for Manual Calculation with Interpolation
1. Sort the Data (if not already sorted):
- Dataset: `[4, 8, 15, 16, 23, 42, 108]` (already sorted)
2. Calculate the Positions for Q1 and Q3 with Interpolation:
- The position `P` for a given percentile in a dataset of size `n` is calculated using:
3. Interpolate Between Data Points:
- Since the positions are whole numbers, normally we would just take the
values at these positions. However, `np.percentile` averages the nearest
neighbors when the percentile falls exactly on an index. So for Q1 (2nd
position), we average the 2nd and 3rd values: (8 + 15)/2 = 11.5.
- Similarly, for Q3 (6th position), we average the 6th and 7th values: (42 + 108)/2 = 32.5.
4. Calculate IQR:
- IQR = Q3 - Q1 = 32.5 - 11.5 = 21.0.
Identifying Outliers
1. Find the lower bound (Q1 - 1.5 * IQR) and upper bound (Q3 + 1.5 * IQR).
lower bound = 11.5 - 1.5*21 = -20
upper bound = 32.5 + 1.5*21 = 64
2. Identify any numbers outside this range as outliers.
108 lies outside the boundary and so it is an outlier.
Python Code :
Data-Science/outliers.ipynb at main · lovelynrose/Data-Science (github.com)
Comments
Post a Comment