Handling missing values Missing values are a common challenge in machine learning and data analysis. They occur when certain data points are missing for specific variables in a dataset. These gaps in information can take the form of blank cells, null values or special symbols like "NA", "NaN" or "unknown." If not addressed properly, missing values can harm the accuracy and reliability of our models. They can reduce the sample size, introduce bias and make it difficult to apply certain analysis techniques that require complete data. Efficiently handling missing values is important to ensure our machine learning models produce accurate and unbiased results. Importance of Handling Missing Values Handling missing values is important for ensuring the accuracy and reliability of data analysis and machine learning models. Key reasons include: Improved Model Accuracy: Addressing missing values helps avoid incorrect predictions and boosts model performance. Increased Statistical Power: Imputation or removal of missing data allows the use of more analysis techniques, maintaining the sample size. Missing values can introduce several challenges in data analysis including: Reduce sample size: If rows or data points with missing values are removed, it reduces the overall sample size which may decrease the reliability and accuracy of the analysis. import pandas as pd import numpy as np data = { 'School ID': [101, 102, 103, np.nan, 105, 106, 107, 108], 'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva', 'Frank', 'Grace', 'Henry'], 'Address': ['123 Main St', '456 Oak Ave', '789 Pine Ln', '101 Elm St', np.nan, '222 Maple Rd', '444 Cedar Blvd', '555 Birch Dr'], 'City': ['Los Angeles', 'New York', 'Houston', 'Los Angeles', 'Miami', np.nan, 'Houston', 'New York'], 'Subject': ['Math', 'English', 'Science', 'Math', 'History', 'Math', 'Science', 'English'], 'Marks': [85, 92, 78, 89, np.nan, 95, 80, 88], 'Rank': [2, 1, 4, 3, 8, 1, 5, 3], 'Grade': ['B', 'A', 'C', 'B', 'D', 'A', 'C', 'B'] } df = pd.DataFrame(data) print("Sample DataFrame:") print(df) #Removing Rows with Missing Values #Removing rows with missing values is a simple and straightforward method to handle missing data, used #when we want to keep our analysis clean and minimize complexity. df_cleaned = df.dropna() print("\nDataFrame after removing rows with missing values:") print(df_cleaned) #Imputation Methods #Imputation involves replacing missing values with estimated values. This approach is beneficial when we want #to preserve the dataset’s sample size and avoid losing data points. However, it's important to note that the #accuracy of the imputed values may not always be reliable. mean_imputation = df['Marks'].fillna(df['Marks'].mean()) median_imputation = df['Marks'].fillna(df['Marks'].median()) mode_imputation = df['Marks'].fillna(df['Marks'].mode().iloc[0]) print("\nImputation using Mean:") print(mean_imputation) print("\nImputation using Median:") print(median_imputation) print("\nImputation using Mode:") print(mode_imputation) /// explaination mode_imputation = df['Marks'].fillna(df['Marks'].mode().iloc[0]) It scans all 50 values and finds the most frequent value. .iloc[0] Even if multiple frequent values (modes) exist, iloc[0] selects only the first one (position 0) /// #Forward and backward fill techniques are used to replace missing values by filling them with the nearest #non-missing values from the same column. This is useful when there’s an inherent order or sequence in the #data. forward_fill = df['Marks'].fillna(method='ffill') backward_fill = df['Marks'].fillna(method='bfill') print("\nForward Fill:") print(forward_fill) print("\nBackward Fill:") print(backward_fill) #Interpolation Techniques #Interpolation is a technique used to estimate missing values based on the values of surrounding data points. #Unlike simpler imputation methods (e.g mean, median, mode), interpolation uses the relationship between #neighboring values to make more informed estimations. linear_interpolation = df['Marks'].interpolate(method='linear') quadratic_interpolation = df['Marks'].interpolate(method='quadratic') print("\nLinear Interpolation:") print(linear_interpolation) print("\nQuadratic Interpolation:") print(quadratic_interpolation) Linear Interpolation fills the gaps by assuming a straight line between two adjacent known data points. Quadratic Interpolation uses a curve (a quadratic polynomial) to fit the missing points, taking into account more neighboring values to create a smoother, potentially more natural, curve for the data trend.