Complete Pipeline Example
This Jupyter Notebook provide a complete example of classical experimental setup for Missing Data studies. The main four steps are (Santos et al. (2019)):
- Data Collection: We used the Breast Cancer Wiscosin from Scikit-learn, which is complete (i.e., without missing values)
- Missing Data Generation: We selected to generate artificial missing data under MNAR mechanism
- Imputation: We performed the imputation by Multiple Imputation by Chained Equations (MICE)
- Evaluation: We evaluated the imputation quality with Mean Squared Error (MSE)
Import the libraries¶
In [ ]:
Copied!
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from mdatagen.multivariate.mMNAR import mMNAR
from mdatagen.metrics import EvaluateImputation
from mdatagen.plots import PlotMissingData
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from mdatagen.multivariate.mMNAR import mMNAR
from mdatagen.metrics import EvaluateImputation
from mdatagen.plots import PlotMissingData
Step 1: Data Collection¶
In [ ]:
Copied!
# Load the data
wiscosin = load_breast_cancer()
wiscosin_df = pd.DataFrame(data=wiscosin.data, columns=wiscosin.feature_names)
X = wiscosin_df.copy() # Features
y = wiscosin.target # Label values
# Load the data
wiscosin = load_breast_cancer()
wiscosin_df = pd.DataFrame(data=wiscosin.data, columns=wiscosin.feature_names)
X = wiscosin_df.copy() # Features
y = wiscosin.target # Label values
Step 2: Missing Data Generation¶
In [ ]:
Copied!
# Create a instance with missing rate equal to 20% in dataset under MNAR mechanism
generator = mMNAR(X=X, y=y)
# Generate the missing data under MNAR mechanism
generate_MDdata = generator.random(missing_rate=20,
deterministic=True)
# Create a instance with missing rate equal to 20% in dataset under MNAR mechanism
generator = mMNAR(X=X, y=y)
# Generate the missing data under MNAR mechanism
generate_MDdata = generator.random(missing_rate=20,
deterministic=True)
In [ ]:
Copied!
# Visualize the missingness
miss_plot = PlotMissingData(data_missing=generate_MDdata,
data_original=wiscosin_df)
miss_plot.visualize_miss("normal")
# Visualize the missingness
miss_plot = PlotMissingData(data_missing=generate_MDdata,
data_original=wiscosin_df)
miss_plot.visualize_miss("normal")
Step 3: Imputation¶
In [ ]:
Copied!
# Initialize the MICE imputer
imputer = IterativeImputer(max_iter=100)
# Training the Imputer
imputer.fit(generate_MDdata)
col = X.columns.to_list() # Columns names in result dataframe
col.append("target")
df_imputate = pd.DataFrame(
imputer.transform(generate_MDdata), columns = pd.Index(col)
)
# Initialize the MICE imputer
imputer = IterativeImputer(max_iter=100)
# Training the Imputer
imputer.fit(generate_MDdata)
col = X.columns.to_list() # Columns names in result dataframe
col.append("target")
df_imputate = pd.DataFrame(
imputer.transform(generate_MDdata), columns = pd.Index(col)
)
Step 4: Evalutation¶
In [ ]:
Copied!
eval_metric = EvaluateImputation(data_imputed=df_imputate,
data_original=X,
metric="mean_squared_error")
eval_metric.show()
eval_metric = EvaluateImputation(data_imputed=df_imputate,
data_original=X,
metric="mean_squared_error")
eval_metric.show()