RA2-DREAM Challenge: Automated Scoring of Radiographic Damage in Rheumatoid Arthritis

Rheumatoid arthritis (RA) is a common chronic autoimmune disease characterized by inflammation of the synovium leading to joint space narrowing and bony erosions around the joints. The current state-of-the-art method for quantifying the degree of joint damage is human visual inspection of radiographic images by highly trained readers. This tedious, expensive, and non-scalable method is an impediment to research on factors associated with RA joint damage and its progression, and may delay appropriate treatment decisions by clinicians. We sought to develop automatic, rapid, accurate methods to quantify the degree of joint damage in patients with RA using machine learning or deep learning through the community crowdsourced RA2-DREAM Challenge. The motivation for the Challenge, background related to the scoring of joint damage in RA, and the scored radiographic images from clinical studies that supported the Challenge will be described. In addition, each of the three sub-challenges will be discussed: 1: Predict overall RA damage from radiographic images of hands and feet; 2: Predict joint space narrowing scores from radiographic images of hands and feet. 3: Predict joint erosion scores from radiographic images of hands and feet.

A framework for studying machine learning methods in healthcare: The First EHR DREAM Challenge

Implementation of machine learning-based methods in healthcare is of high interest and has the potential to positively impact patient care. To that end, real world accuracy and outcomes from the application of these methods remain largely unknown, and performance on different subpopulations of patients also remains unclear. In order to address these important questions, we hosted a community challenge to evaluate disparate methods that predict healthcare outcomes. We focused on the prediction of all-cause mortality as it is quantitative and clinically unambiguous. In order to overcome patient privacy concerns, we employed a Model-to-Data approach, allowing citizen scientists and researchers to train and evaluate machine learning models on electronic health records from the University of Washington medical system. We held the EHR DREAM Challenge: Patient Mortality from May 2019 to April 2020. We asked participants to predict the 180 day mortality status from the last visit that each patient had in UW Medicine. In total, we had 354 registered participants, coalescing into 25 independent teams. The top performing team achieved an area under the receiver operator curve of 0.947 (95% CI 0.942, 0.951) and an area under the precision-recall curve of 0.487 on all patients over a one year observation of a large health system. In a follow up phase of the challenge, we extracted the trained features from the best performing methods and evaluated the generalizability of models across different patient populations, revealing that models differ in accuracy on subpopulations, such as race or gender, even when they are trained on the same data and have similar accuracy on the population. This is the broadest community challenge focused on the evaluation of state-of-the-art machine learning methods in healthcare performed to date and shows the importance of prospective evaluation and collaborative development of individual models.

Parkinson's disease symptom assessment in free-living conditions; the BEAT-PD Challenge

Recent advances in mobile health have demonstrated great potential to leverage sensor-based technologies for quantitative, remote monitoring of health and disease - particularly for diseases affecting motor function such as Parkinson’s disease. While infrequent doctor’s visits along with patient recall can be subject to bias, remote monitoring offers the promise of a more objective, holistic picture of the symptoms and complications experienced by patients on a daily basis, which is critical for making decisions about treatment.

Previous work, including the 2017 Parkinson’s Disease Digital Biomarker DREAM Challenge, showed that Parkinson’s diagnosis and symptom severity can be predicted using wearable and consumer sensors worn during the completion of specific short tasks. The BEAT-PD Challenge sought to understand whether symptom severity could be predicted from passive monitoring of patients, as they went about their daily lives, which is a critical component to developing algorithms for remote monitoring. To this end, we leveraged two previously unavailable data sets which collected passive accelerometer data from wrist-worn devices coupled with patient self-reports of symptom severity. Participants were asked to build patient-specific models to predict on/off medication status (subchallenge 1), dyskinesia, an often-violent involuntary movement which arises as a side-effect of medication (subchallenge 2), and tremor (subchallenge 3) for 28 patients. The participant models were compared to a patient-specific null model.

Through this challenge, as well as the post-challenge community phase, we determined that sensor measurements from passive monitoring of Parkinson’s patients can be used to predict symptom severity for a subset of patients. Moreover, these models were also predictive for in-clinic physician-assessments of severity. Patient predictability was generally not related to factors like sample size or reporting lag but was somewhat related to overall disease severity.

Dealing with high dimensional data to predict preterm birth

The gene expression of human cells is a complex system with thousands of interacting components. In several studies researchers successfully used machine learning methods to infer high-level biological phenomena like preterm birth, as in the recent DREAM PTB challenge. Can we really get true biologically meaningful insights with this approach?

Deep Learning-based Prediction of Radiographic Joint Damage in Rheumatoid Arthritis

Rheumatoid arthritis (RA) is an autoimmune disease affecting joints of hands, feet, wrists, ankles, elbows, and knees. It is estimated that about 0.6 percent of the adults in the United States are affected by joint damages associated with RA, including pain and swelling arounds the joint regions. A standard way to evaluate joint damages is manually examining the radiographic images of joints and estimating the severity of joint space narrowing and erosion, which are labor-intensive and time-consuming even for experienced radiologists. Here we present a deep learning-based approach for automatically predicting joint damages and segmenting the regions of interest. This approach ranked top in the 2020 RA2 DREAM Challenge - Automated Scoring of Radiographic Joint Damage.

SVM-based approach to predict preterm birth using omics data

In the DREAM Preterm Birth Prediction Challenge, Transcriptomics (Sub-challenge 2), the goal was to predict the preterm birth phenotypes (sPTD and PPROM) with a minimal set (at most 100) of transcriptomic features. We (team IGIB) performed, 1) differential expression analysis between sPTD vs control and PPROM vs control using t-test or Wilcoxon-test, 2) prioritized top 100 features based on statistical significance p-value, 3) SVM-based classification models (kernel types: linear, sigmoid, and radial) were built with 5-fold cross-validation, and 4) Based on the overall sensitivity and specificity across 5-fold CV, the best SVM-approach, radial-SVM was selected for prediction of the preterm birth phenotypes (sPTD and PPROM). Overall the performances for radial-SVM models, to predict sPTD was 96.51% (sensitivity) and 96% (specificity); and to predict PPROM was 100%(sensitivity) and 100% (specificity).

An Iterative Strategy Optimizing CDE Recommendations from Real-World Data

Annotating medical metadata—and metadata in general—is a tedious and error-prone task for humans. There are usually many usable machine-assisted methods to achieve the same goals, from simple rule-based systems to algorithms applying the latest and most sophisticated findings in the ML world. During the Metadata Annotation DREAM Challenge, teams attempted to mimic the ability of individual curators to choose common data elements—standardized and curated definitions of fields that can be used on clinical forms—that are appropriate for a given data set, containing given header labels and data values. The CEDAR Team developed an algorithm which tries to achieve good results against the provided scoring algorithm, while keeping a relatively simple algorithm with a quick runtime. Our team chose this path so that our algorithm can be easily deployed in real life systems, operated in real-time, used to support human selection, and understood and maintained by its adopters. In this talk, we will describe our approach, its strengths and weaknesses, and why we felt it was a good solution for likely real-world applications involving these types of selection problems.

Crowdsourcing assessment of maternal blood multi-omics for predicting gestational age and preterm birth

Identification of pregnancies at risk of preterm birth (PTB), the leading cause of newborn deaths, remains challenging given the syndromic nature of the disease. We report a longitudinal multi-omics study coupled with a DREAM challenge to develop predictive models of PTB. We found that whole blood gene expression predicts ultrasound-based gestational ages in normal and complicated pregnancies (r=0.83), as well as the delivery date in normal pregnancies (r=0.86), with an accuracy comparable to ultrasound. However, unlike the latter, transcriptomic data collected at <37 weeks of gestation predicted the delivery date of one third of spontaneous (sPTB) cases within 2 weeks of the actual date. Based on samples collected before 33 weeks in asymptomatic women, we found expression changes preceding preterm prelabor rupture of the membranes that were consistent across time points and cohorts, involving, among others, leukocyte-mediated immunity. Plasma proteomic random forests predicted sPTB with higher accuracy and earlier in pregnancy than whole blood transcriptomic models (e.g. AUROC=0.76 vs. AUROC=0.6 at 27-33 weeks of gestation).

Metadata Automation: A TF-IDF and Nearest Neighbors Approach

The goal of the Metadata Automation DREAM Challenge was to develop a tool to automate the annotation of metadata fields and values in structured biomedical data files with the best candidate Common Data Element (CDE) matches from the Cancer Data Standards Registry and Repository (caDSR). We chose to implement our model in Python 3.6 and approached this challenge from the perspective that it was essentially a fuzzy matching problem. Our approach utilizes Scikit-Learn’s TfidfVectorizer class along with a custom n-gram function to vectorize the data. These term frequency - inverse document frequency (TF-IDF) vectors are passed to Scikit-Learn’s Nearest Neighbor class which returns the k nearest CDE neighbors and their associated distance scores for each column header in the biomedical data file. For the returned CDEs with enumerated values, the Levenshtein distances from the observed values in the data to the CDE’s permissible values are computed using Python’s FuzzyWuzzy library. We then use a decision tree approach based on the TF-IDF distance scores and the observed values’ average Levenshtein distance scores to select and rank the top three CDE matches from the set of nearest neighbors for each column header. In this final ranking step, we apply cutoff values to the distance scores to determine when to include ‘NOMATCH’ as one of the three results. Throughout the challenge we experimented with many aspects of the algorithm including modifying the n-gram function, the selection of caDSR fields to include in the TF-IDF vectorization and applying different cutoff values. The final version of our model was arrived at by selecting the features and parameters that maximized the overall score across all the provided test datasets.

Rheumatoid arthritis X-ray evaluation with deep learning

In this work we created a method for automated joint scoring where we confidently detect joints in the hands and feet and we score them with an intricate ensemble model while taking into account joint damage of all limbs with a random forest model. Our approach is very well thought out and we experimented with many failed attempts to make the score better. As far as we are aware there are no similar work in the literature to ours and we have not used any additional datasets in order to achieve these results.