Novel MNAR Multivariate mechanism
A novel example of generate artificial missing data with mdatagen library with the Breast Cancer Wiscosin dataset from scikit-learn. The features will receive the missing values under Missing Not at Random (MNAR) mechanism. The simulated missing rate is 20%. The method to choose missing values is Missingness Based on Own and Unobserved Values (MBOUV).
In [1]:
Copied!
# Import the libraries
import pandas as pd
from sklearn.datasets import load_breast_cancer
from mdatagen.multivariate.mMNAR import mMNAR
# Load the data
wiscosin = load_breast_cancer()
wiscosin_df = pd.DataFrame(data=wiscosin.data, columns=wiscosin.feature_names)
X = wiscosin_df.copy() # Features
y = wiscosin.target # Label values
# Create a instance with missing rate equal to 20% in dataset under MNAR mechanism
generator = mMNAR(X=X, y=y)
# Generate the missing data under MNAR mechanism
generate_data = generator.MBOUV(missing_rate=20, depend_on_external=X.columns)
print(generate_data.isna().sum())
# Import the libraries
import pandas as pd
from sklearn.datasets import load_breast_cancer
from mdatagen.multivariate.mMNAR import mMNAR
# Load the data
wiscosin = load_breast_cancer()
wiscosin_df = pd.DataFrame(data=wiscosin.data, columns=wiscosin.feature_names)
X = wiscosin_df.copy() # Features
y = wiscosin.target # Label values
# Create a instance with missing rate equal to 20% in dataset under MNAR mechanism
generator = mMNAR(X=X, y=y)
# Generate the missing data under MNAR mechanism
generate_data = generator.MBOUV(missing_rate=20, depend_on_external=X.columns)
print(generate_data.isna().sum())
mean radius 166 mean texture 61 mean perimeter 35 mean area 148 mean smoothness 156 mean compactness 161 mean concavity 112 mean concave points 171 mean symmetry 93 mean fractal dimension 110 radius error 74 texture error 160 perimeter error 82 area error 63 smoothness error 160 compactness error 152 concavity error 99 concave points error 143 symmetry error 171 fractal dimension error 1 worst radius 170 worst texture 159 worst perimeter 4 worst area 170 worst smoothness 51 worst compactness 148 worst concavity 42 worst concave points 126 worst symmetry 78 worst fractal dimension 148 target 0 dtype: int64
In [ ]:
Copied!