This data is a transformed version of the SCD data from the paper by Al-Dhamari et al. Synthetic datasets for open software development in rare disease research. Orphanet J Rare Dis 19, 265 (2024).We have retained a subset of the data columns that are relevant to our model and transformed the data into a representative cohort by retaining an expected prevalence of SCD (0.3%), with the rest converted to non-SCD patients by distributing the biomarker values around a healthy value. These columns are described below.
Format
scd_cohort
A data frame with 100,403 rows and 9 columns:
- age
Patient Age
- sex
Patient gender assuming only Male and Female genders
- race
Patient race. One of "Others", "African-American", "European-American"
- birthDate
Patient birth date
- diagDate
Patient diagnosis date
- CBC
Complete Blood Count biomarker test in g/dL
- RC
Reticulocytes Count biomarker test in % Reticulocytes
- highrisk
Flag for high risk ethnicity
- SCD
Flag indicating SCD observations to test model performance
Source
Al-Dhamari (2024) doi:10.1186/s13023-024-03254-2.