RA2-DREAM Challenge: Automated Scoring of Radiographic Damage in Rheumatoid Arthritis

Rheumatoid arthritis (RA) is a common chronic autoimmune disease characterized by inflammation of the synovium leading to joint space narrowing and bony erosions around the joints. The current state-of-the-art method for quantifying the degree of joint damage is human visual inspection of radiographic images by highly trained readers. This tedious, expensive, and non-scalable method is an impediment to research on factors associated with RA joint damage and its progression, and may delay appropriate treatment decisions by clinicians. We sought to develop automatic, rapid, accurate methods to quantify the degree of joint damage in patients with RA using machine learning or deep learning through the community crowdsourced RA2-DREAM Challenge. The motivation for the Challenge, background related to the scoring of joint damage in RA, and the scored radiographic images from clinical studies that supported the Challenge will be described. In addition, each of the three sub-challenges will be discussed: 1: Predict overall RA damage from radiographic images of hands and feet; 2: Predict joint space narrowing scores from radiographic images of hands and feet. 3: Predict joint erosion scores from radiographic images of hands and feet.

A framework for studying machine learning methods in healthcare: The First EHR DREAM Challenge

Implementation of machine learning-based methods in healthcare is of high interest and has the potential to positively impact patient care. To that end, real world accuracy and outcomes from the application of these methods remain largely unknown, and performance on different subpopulations of patients also remains unclear. In order to address these important questions, we hosted a community challenge to evaluate disparate methods that predict healthcare outcomes. We focused on the prediction of all-cause mortality as it is quantitative and clinically unambiguous. In order to overcome patient privacy concerns, we employed a Model-to-Data approach, allowing citizen scientists and researchers to train and evaluate machine learning models on electronic health records from the University of Washington medical system. We held the EHR DREAM Challenge: Patient Mortality from May 2019 to April 2020. We asked participants to predict the 180 day mortality status from the last visit that each patient had in UW Medicine. In total, we had 354 registered participants, coalescing into 25 independent teams. The top performing team achieved an area under the receiver operator curve of 0.947 (95% CI 0.942, 0.951) and an area under the precision-recall curve of 0.487 on all patients over a one year observation of a large health system. In a follow up phase of the challenge, we extracted the trained features from the best performing methods and evaluated the generalizability of models across different patient populations, revealing that models differ in accuracy on subpopulations, such as race or gender, even when they are trained on the same data and have similar accuracy on the population. This is the broadest community challenge focused on the evaluation of state-of-the-art machine learning methods in healthcare performed to date and shows the importance of prospective evaluation and collaborative development of individual models.

Parkinson's disease symptom assessment in free-living conditions; the BEAT-PD Challenge

Recent advances in mobile health have demonstrated great potential to leverage sensor-based technologies for quantitative, remote monitoring of health and disease - particularly for diseases affecting motor function such as Parkinson’s disease. While infrequent doctor’s visits along with patient recall can be subject to bias, remote monitoring offers the promise of a more objective, holistic picture of the symptoms and complications experienced by patients on a daily basis, which is critical for making decisions about treatment.

Previous work, including the 2017 Parkinson’s Disease Digital Biomarker DREAM Challenge, showed that Parkinson’s diagnosis and symptom severity can be predicted using wearable and consumer sensors worn during the completion of specific short tasks. The BEAT-PD Challenge sought to understand whether symptom severity could be predicted from passive monitoring of patients, as they went about their daily lives, which is a critical component to developing algorithms for remote monitoring. To this end, we leveraged two previously unavailable data sets which collected passive accelerometer data from wrist-worn devices coupled with patient self-reports of symptom severity. Participants were asked to build patient-specific models to predict on/off medication status (subchallenge 1), dyskinesia, an often-violent involuntary movement which arises as a side-effect of medication (subchallenge 2), and tremor (subchallenge 3) for 28 patients. The participant models were compared to a patient-specific null model.

Through this challenge, as well as the post-challenge community phase, we determined that sensor measurements from passive monitoring of Parkinson’s patients can be used to predict symptom severity for a subset of patients. Moreover, these models were also predictive for in-clinic physician-assessments of severity. Patient predictability was generally not related to factors like sample size or reporting lag but was somewhat related to overall disease severity.

Personalized prediction of on-off medication state from wearable-derived time-series features

Wearables hold potential for rich monitoring of patient state, particularly in chronic conditions such as Parkinson's disease. However, clinically useful information is difficult to extract due to the high dimensionality and large amounts of noise inherent to real world sensor data. In such data regimes, deep learning techniques can be susceptible to overfitting, and simpler techniques may actually be preferable. We developed a data pipeline to predict on-off states for Parkinson’s disease from wearable accelerometer data while minimizing overfitting. The input to our pipeline was raw sensor data consisting of triaxial acceleration time series signals measured from smartwatches. We combined Individual sensor axes and removed gravitational acceleration from the combined signal. We then extracted time-series features from the processed signal and fit a random forest to predict on-off state for each patient. To expand the training set, we divided each full-length observation into 10 second segments. Our pipeline generated predictions for each segment and used the ensembled median value as the prediction for the observation. This pipeline significantly outperformed the null model, as well as deep learning approaches, in both an internal validation and a held-out test set. Our approach emphasized parsimony and interpretability without sacrificing model performance.

A Registry of Open Community Challenges (ROCC) to Increase Ease of Discovery and Challenge Participation

Over the years, a growing number of various biomedical and benchmarking challenges have become more popular among the open-community. However, there is currently not a straightforward way to search for and query information about active and upcoming challenges in one place. Instead, one must sleuth through many avenues to look for one that may be of interest and/or fit their expertise, which may unfortunately result in missing key dates for participation. The goal of the Registry of Open Community Challenges (ROCC) is to increase the ease of discovery for these challenges, by creating a portal that will standardize and highlight key features about a challenge. These captured metadata are based on a schema known as the “minimal information about a challenge” (MIAC), and examples include challenge questions, available data, timelines and rounds, funders and organizers, domains, scoring metrics, and type of submission accepted (traditional or model-to-data). Participants will be able to use ROCC to search for challenges in one of two ways: navigate through a web-based platform or call on a set of RESTful APIs. ROCC can also be utilized by challenge organizers to upload information about upcoming challenges or to update details on an active challenge. The development of a prototype of ROCC is currently underway and will initially focus on 38 DREAM challenges (from 2013 to mid-2020) and 152 non-DREAM challenges, including CAGI, CAMDA, BioCreAtivE, CASP, and more. A long-term goal of the ROCC is to expand to more non-DREAM challenges, and to create a higher standard for how open-community challenges are annotated, which could then lead to higher discoverability and increased participation.

CTD2 Beat AML DREAM Challenge: Strategies for Prediction of Drug Efficacy and Patient Outcomes

In the era of precision medicine, acute myeloid leukemia (AML) patients have few therapeutic options: “7 + 3” induction chemotherapy has remained the standard for decades. While several agents targeting the myeloid marker CD33, alterations in FLT3 or IDH1/2, or the anti-apoptotic protein BCL2 have demonstrated efficacy in patients, responses are muted in some populations and relapse remains prevalent. There is an urgent need for targeted treatment options that are tailored to more refined patient subpopulations in order to achieve durable responses.

To address this need, we hosted an NCI-sponsored Beat AML DREAM Challenge under the auspices of the Cancer Target Discovery and Development (CTD2) program. In this community-wide assessment, participants predicted ex vivo sensitivity of AML patient primary cells to 122 targeted and chemotherapeutic agents using genomic, transcriptomic, and clinical data (sub-Challenge 1; SC1) and predicted clinical response using these data as well as the ex vivo drug sensitivity data (SC2). Data were furnished by the Beat AML initiative, which comprehensively profiled AML patient samples using whole-exome sequencing (WES), transcriptome sequencing (RNA-seq), and ex vivo functional drug sensitivity screens. Participants developed and tuned their methods using published training data (n=213 specimens) and subsequently received scored submissions on published “leaderboard” data (n=80). Final submissions were ranked on validation data (n=65) we generated for this Challenge using a primary scoring metric, with statistical ties resolved using a secondary metric.

Twenty eight participants entered submissions for SC1. We applied two baseline comparator models: a ridge regression model using only expression data (primary metric Spearman’s rho = 0.32; secondary metric Pearson’s r = 0.32) and a Bayesian multitask multiple kernel learning method using expression and mutation data (rho = 0.31; r = 0.32), which was the top-performing method in a related assessment of drug sensitivity prediction across breast cancer cell lines in vitro. The top-performing participant improved upon both models (rho = 0.37; r = 0.38). Six of the top seven participants, including the first-ranked, used multitask approaches or otherwise shared information across the drugs. Fourteen participants entered submissions for SC2. A baseline Cox proportional hazards model with LASSO regularization using all available data modalities achieved a concordance index (CI; primary metric) of 0.68 and an AUC (secondary metric) of 0.65. Four participants were tied based on the primary metric, with the top participant determined by the secondary metric (CI = 0.77; AUC = 0.75).

Omics-based prediction of preterm birth by Gaussian Process Regression models

We used a combination of SVM and GPR. The main task of the challenge was to tune the parameters of these two algorithms and assembling them. We included all samples into training (no matter microarray or RNAseq). We quantile normalized each sample. The meaning of tuning parameters in SVM and GPR is to find out how much noise are there in the expression data. It was through a systematic grid search. Models were weighted equally when predictions are combined.

Flash talk BEAT-PD: use deep learning to predict tremor severity

The hallmark of digital medicine is the ability to monitor patients remotely without a physician. While accelerometer/gyroscope-based digital biomarkers have been developed to classify many diseases such as Parkinson’s, in general it remains an open question whether they can be used to monitor severity, particularly in a free-living environment. We report modalities and algorithms that combat the confounding factors in free-living environments and enable remote tremor severity monitoring for individual Parkinson’s patients. We found the fundamental reasons why previous attempts failed: direct regression against severity scores indeed produced no signal as existing studies, and we point to the critical aspects in constructing personalized parameters that allowed the model to place top in the BEAT-PD End Point Challenge. We envision that the methodology will have direct applications in clinical trials and patient care that requires objective, fine-grained scoring and can be adopted to the digital biomarker field for many other neurological or movement conditions.

A multistage deep learning method for scoring radiographic hand and foot joint damage in rheumatoid arthritis

We'll talk about our entry to the RA2 DREAM Challenge, which won the overall damage prediction category (SC1) - for details see our writeup at https://www.synapse.org/#!Synapse:syn21478998/wiki/604432

The main difficulty in this competition was the lack of training data. We'll review the strategies we used to deal with this, including:
- Using a DL model to convert all images to the same dihedral orientation.
- Using a DL model to locate joints and cut out joint images - this enabled us to merge groups of joints into one model, multiplying the training data available per prediction.
- Thoughtful use of data augmentation, including perspective warps.
- Using a carefully chosen pretrained architecture and cross-validation for final damage prediction.
Used together, these strategies enabled us to use potentially higher-performance deep learning models without overfitting.

Finally, we'll discuss what we think is an interesting open question: whether to use a postprocessing stage in which we adjust a patient's individual joint predictions based on the predictions from their other joints. Unlike the other winning entries, we didn't do this, because we felt unsure about whether this is a good thing to do in practice. We'll present some preliminary analysis of this question based on the competition training set.

Assessment of Parkinson's disease dyskinesia in a free-living environment

While there is inherent value in a clinician examination, the “gold standard” clinical assessment for Parkinson's disease (PD), the MDS-UPDRS, is subjective, administered sporadically, and may not be reflective of the full burden of disease severity. Thus, more frequently-administered, objective measurements via digital technologies can allow for a more accurate detailing of one's disease severity and treatment response.

The BEAT-PD DREAM Challenge offered a dataset of smartwatch/smartphone accelerometer recordings from 16 patients with PD, who self-reported their level of dyskinesia (e.g. on a scale of 0-4) during each recording period.

We trained a random forest regression model to predict the level of dyskinesia based on measurements extracted from the accelerometer signals. 16 features were used from the accelerometers, such as the mean acceleration and dominant frequency of motion. In addition to the accelerometer features, the patient characteristics (e.g. age, gender, and baseline UPDRS scores) were also used to train the model. This allows the model to develop branches personalized for certain (types of) patients. Personalization is important not only due to differing patient lifestyles and disease progression, but also because the labels for these data are patient-reported, i.e. they are subjective. The total set of features can be reduced by principal component analysis (PCA) or recursive feature elimination (RFE) without significant impact to accuracy.

The model makes a prediction for every 30 seconds of activity. For more stable predictions, these estimates can be averaged over longer periods of time (such as the 20-minute recordings of the DREAM challenge).

Our model predicted dyskinesia severity with a mean per-patient error of 0.4053. In validation, we found that the model performed well on less severe dyskinesias, but under-estimated in relatively rare cases of high severity (e.g. 4 out of 4 dyskinesia). Future improvements could be made by addressing this class imbalance. We would also like to incorporate time and date into the model to capture circadian patterns. The current version outperformed all 37 other teams in the BEAT-PD DREAM Challenge.

We found that the UPDRS scores were very important features for dyskinesia prediction. In many cases, sensor-derived features were secondary to the UPDRS values. Some of the most important sensor-derived features were the mean acceleration, power spectral entropy, and correlation coefficients between acceleration axes.

Our code is publicly available: https://bitbucket.org/atpage/beat-pd/.