This data is a transformed version of the SCD data from the paper by Al-Dhamari et al. Synthetic datasets for open software development in rare disease research. Orphanet J Rare Dis 19, 265 (2024).We have retained a subset of the data columns that are relevant to our model and transformed the data into a representative cohort by retaining an expected prevalence of SCD (0.3%), with the rest converted to non-SCD patients by distributing the biomarker values around a healthy value. These columns are described below.
scd_cohort
scd_cohort
A data frame with 100,403 rows and 9 columns:
Patient Age
Patient gender assuming only Male and Female genders
Patient race. One of "Others", "African-American", "European-American"
Patient birth date
Patient diagnosis date
Complete Blood Count biomarker test in g/dL
Reticulocytes Count biomarker test in % Reticulocytes
Flag for high risk ethnicity
Flag indicating SCD observations to test model performance
Al-Dhamari (2024) doi:10.1186/s13023-024-03254-2.