This data is a transformed version of the SCD data from the paper by Al-Dhamari et al. Synthetic datasets for open software development in rare disease research. Orphanet J Rare Dis 19, 265 (2024).We have retained a subset of the data columns that are relevant to our model and transformed the data into a representative cohort by retaining an expected prevalence of SCD (0.3%), with the rest converted to non-SCD patients by distributing the biomarker values around a healthy value. These columns are described below.

scd_cohort

Format

scd_cohort

A data frame with 100,403 rows and 9 columns:

age

Patient Age

sex

Patient gender assuming only Male and Female genders

race

Patient race. One of "Others", "African-American", "European-American"

birthDate

Patient birth date

diagDate

Patient diagnosis date

CBC

Complete Blood Count biomarker test in g/dL

RC

Reticulocytes Count biomarker test in % Reticulocytes

highrisk

Flag for high risk ethnicity

SCD

Flag indicating SCD observations to test model performance

Source

Al-Dhamari (2024) doi:10.1186/s13023-024-03254-2.