Data Preprocessing 2 - Data Imputation
Data imputation is a process used in statistics and data science to handle missing values in datasets. When a dataset has missing or null values, it can lead to inaccurate results during analysis or model training. To address this, data imputation techniques are used to estimate and fill in these missing values.
When to do it
Use it only when 5-15% of data has this. This is surely not right data.
Ensure statistical properties of data are maintained even after imputation.
Consider the following column in a dataset:
[5, nan, nan, 8, 10]
The key approaches and methods in data imputation include:
1. Mean Imputation:
Replacing the missing values with the mean value of the entire feature (column). This is best used when the data distribution is normal and the missing values are not too many.
Eg. Finding mean we get
(5+8+10)/3 = 7.67
Result : [5, 7.67, 7.67, 8, 10]
2. Median Imputation:
Similar to mean imputation, but using the median value. This is more robust to outliers compared to mean imputation.
Eg. Median of 5, 8, 10 is 8
Result : [5, 8, 8, 8, 10]
3. Mode Imputation:
For categorical data, missing values can be replaced with the mode (the most frequently occurring value in the dataset).
4. Constant Value Imputation:
Replacing missing values with a constant value. This is typically used when missing values have a certain meaning, like zero in some contexts.
Eg. Let constant = -1
Result = [5, -1, -1, 8, 10]
5. Forward Fill or Backward Fill:
In time series data, missing values can be filled using the next (forward fill) or previous (backward fill) data points.
Eg. Forward fill : Last non-missing value before nan is 5
Result : [5, 5, 5, 8, 10]
Backward Fill : Next non-missing value after nan is 8.
Result: [5, 8, 8, 8, 10]
6. Linear Interpolation:
In time series, linear interpolation is used to replace missing values by linearly approximating the next and previous points.
Eg. Points around 1st nan are 5 and 8.
2. Calculate the slope between these points.
y changes from 5 to 8 by 3 values
x takes 3 steps to go from 5 to 8 in the array. i.e. position 1 to 4
Slope = change in y/change in x
= 3/3 = 1
Apply slope : 5+1 = 6
6+1 = 7
Result : [5, 6, 7, 8, 10]
7. K-Nearest Neighbors (KNN) Imputation:
Utilizes the K-Nearest Neighbors algorithm to estimate and replace missing data based on similar data points. Ensure k is odd. Though computationally expensive, it is preferred. Usual k values are 3 or 5.
8. Regression Imputation:
Missing values are estimated using a regression model. Each feature with missing data is modeled using other features in the dataset. This is also good.
9. Multiple Imputation:
A more sophisticated approach where multiple imputations are done, creating several complete datasets. Statistical analysis is then performed on all these datasets, and the results are combined.
10. Hot Deck Imputation:
Involves randomly choosing the missing value from a set of related and similar variables.
11. Cold Deck Imputation:
Similar to hot deck, but the value is not chosen randomly, rather it's taken from an individual who has similar values on other variables.
The choice
of imputation technique depends on the nature of the data, the extent of the
missingness, and the type of analysis or modeling to be performed. It's
important to note that imputation introduces some level of bias and should be
carefully considered and documented in any analysis.
12. Mode Imputation(categorial variables)
Can result in distortion. Eg. if I replace with F cause F has highest mode, what if it originally was for M. So kNN better.
Python Code :
Data-Science/data imputation.ipynb at main · lovelynrose/Data-Science (github.com)
Comments
Post a Comment