RA2-DREAM Challenge: Automated Scoring of Radiographic Damage in Rheumatoid Arthritis

Rheumatoid arthritis (RA) is a common chronic autoimmune disease characterized by inflammation of the synovium leading to joint space narrowing and bony erosions around the joints. The current state-of-the-art method for quantifying the degree of joint damage is human visual inspection of radiographic images by highly trained readers. This tedious, expensive, and non-scalable method is an impediment to research on factors associated with RA joint damage and its progression, and may delay appropriate treatment decisions by clinicians. We sought to develop automatic, rapid, accurate methods to quantify the degree of joint damage in patients with RA using machine learning or deep learning through the community crowdsourced RA2-DREAM Challenge. The motivation for the Challenge, background related to the scoring of joint damage in RA, and the scored radiographic images from clinical studies that supported the Challenge will be described. In addition, each of the three sub-challenges will be discussed: 1: Predict overall RA damage from radiographic images of hands and feet; 2: Predict joint space narrowing scores from radiographic images of hands and feet. 3: Predict joint erosion scores from radiographic images of hands and feet.

A framework for studying machine learning methods in healthcare: The First EHR DREAM Challenge

Implementation of machine learning-based methods in healthcare is of high interest and has the potential to positively impact patient care. To that end, real world accuracy and outcomes from the application of these methods remain largely unknown, and performance on different subpopulations of patients also remains unclear. In order to address these important questions, we hosted a community challenge to evaluate disparate methods that predict healthcare outcomes. We focused on the prediction of all-cause mortality as it is quantitative and clinically unambiguous. In order to overcome patient privacy concerns, we employed a Model-to-Data approach, allowing citizen scientists and researchers to train and evaluate machine learning models on electronic health records from the University of Washington medical system. We held the EHR DREAM Challenge: Patient Mortality from May 2019 to April 2020. We asked participants to predict the 180 day mortality status from the last visit that each patient had in UW Medicine. In total, we had 354 registered participants, coalescing into 25 independent teams. The top performing team achieved an area under the receiver operator curve of 0.947 (95% CI 0.942, 0.951) and an area under the precision-recall curve of 0.487 on all patients over a one year observation of a large health system. In a follow up phase of the challenge, we extracted the trained features from the best performing methods and evaluated the generalizability of models across different patient populations, revealing that models differ in accuracy on subpopulations, such as race or gender, even when they are trained on the same data and have similar accuracy on the population. This is the broadest community challenge focused on the evaluation of state-of-the-art machine learning methods in healthcare performed to date and shows the importance of prospective evaluation and collaborative development of individual models.

Parkinson's disease symptom assessment in free-living conditions; the BEAT-PD Challenge

Recent advances in mobile health have demonstrated great potential to leverage sensor-based technologies for quantitative, remote monitoring of health and disease - particularly for diseases affecting motor function such as Parkinson’s disease. While infrequent doctor’s visits along with patient recall can be subject to bias, remote monitoring offers the promise of a more objective, holistic picture of the symptoms and complications experienced by patients on a daily basis, which is critical for making decisions about treatment.

Previous work, including the 2017 Parkinson’s Disease Digital Biomarker DREAM Challenge, showed that Parkinson’s diagnosis and symptom severity can be predicted using wearable and consumer sensors worn during the completion of specific short tasks. The BEAT-PD Challenge sought to understand whether symptom severity could be predicted from passive monitoring of patients, as they went about their daily lives, which is a critical component to developing algorithms for remote monitoring. To this end, we leveraged two previously unavailable data sets which collected passive accelerometer data from wrist-worn devices coupled with patient self-reports of symptom severity. Participants were asked to build patient-specific models to predict on/off medication status (subchallenge 1), dyskinesia, an often-violent involuntary movement which arises as a side-effect of medication (subchallenge 2), and tremor (subchallenge 3) for 28 patients. The participant models were compared to a patient-specific null model.

Through this challenge, as well as the post-challenge community phase, we determined that sensor measurements from passive monitoring of Parkinson’s patients can be used to predict symptom severity for a subset of patients. Moreover, these models were also predictive for in-clinic physician-assessments of severity. Patient predictability was generally not related to factors like sample size or reporting lag but was somewhat related to overall disease severity.

An Iterative Strategy Optimizing CDE Recommendations from Real-World Data

Annotating medical metadata—and metadata in general—is a tedious and error-prone task for humans. There are usually many usable machine-assisted methods to achieve the same goals, from simple rule-based systems to algorithms applying the latest and most sophisticated findings in the ML world. During the Metadata Annotation DREAM Challenge, teams attempted to mimic the ability of individual curators to choose common data elements—standardized and curated definitions of fields that can be used on clinical forms—that are appropriate for a given data set, containing given header labels and data values. The CEDAR Team developed an algorithm which tries to achieve good results against the provided scoring algorithm, while keeping a relatively simple algorithm with a quick runtime. Our team chose this path so that our algorithm can be easily deployed in real life systems, operated in real-time, used to support human selection, and understood and maintained by its adopters. In this talk, we will describe our approach, its strengths and weaknesses, and why we felt it was a good solution for likely real-world applications involving these types of selection problems.

Crowdsourcing assessment of maternal blood multi-omics for predicting gestational age and preterm birth

Identification of pregnancies at risk of preterm birth (PTB), the leading cause of newborn deaths, remains challenging given the syndromic nature of the disease. We report a longitudinal multi-omics study coupled with a DREAM challenge to develop predictive models of PTB. We found that whole blood gene expression predicts ultrasound-based gestational ages in normal and complicated pregnancies (r=0.83), as well as the delivery date in normal pregnancies (r=0.86), with an accuracy comparable to ultrasound. However, unlike the latter, transcriptomic data collected at <37 weeks of gestation predicted the delivery date of one third of spontaneous (sPTB) cases within 2 weeks of the actual date. Based on samples collected before 33 weeks in asymptomatic women, we found expression changes preceding preterm prelabor rupture of the membranes that were consistent across time points and cohorts, involving, among others, leukocyte-mediated immunity. Plasma proteomic random forests predicted sPTB with higher accuracy and earlier in pregnancy than whole blood transcriptomic models (e.g. AUROC=0.76 vs. AUROC=0.6 at 27-33 weeks of gestation).

Metadata Automation: A TF-IDF and Nearest Neighbors Approach

The goal of the Metadata Automation DREAM Challenge was to develop a tool to automate the annotation of metadata fields and values in structured biomedical data files with the best candidate Common Data Element (CDE) matches from the Cancer Data Standards Registry and Repository (caDSR). We chose to implement our model in Python 3.6 and approached this challenge from the perspective that it was essentially a fuzzy matching problem. Our approach utilizes Scikit-Learn’s TfidfVectorizer class along with a custom n-gram function to vectorize the data. These term frequency - inverse document frequency (TF-IDF) vectors are passed to Scikit-Learn’s Nearest Neighbor class which returns the k nearest CDE neighbors and their associated distance scores for each column header in the biomedical data file. For the returned CDEs with enumerated values, the Levenshtein distances from the observed values in the data to the CDE’s permissible values are computed using Python’s FuzzyWuzzy library. We then use a decision tree approach based on the TF-IDF distance scores and the observed values’ average Levenshtein distance scores to select and rank the top three CDE matches from the set of nearest neighbors for each column header. In this final ranking step, we apply cutoff values to the distance scores to determine when to include ‘NOMATCH’ as one of the three results. Throughout the challenge we experimented with many aspects of the algorithm including modifying the n-gram function, the selection of caDSR fields to include in the TF-IDF vectorization and applying different cutoff values. The final version of our model was arrived at by selecting the features and parameters that maximized the overall score across all the provided test datasets.

Rheumatoid arthritis X-ray evaluation with deep learning

In this work we created a method for automated joint scoring where we confidently detect joints in the hands and feet and we score them with an intricate ensemble model while taking into account joint damage of all limbs with a random forest model. Our approach is very well thought out and we experimented with many failed attempts to make the score better. As far as we are aware there are no similar work in the literature to ours and we have not used any additional datasets in order to achieve these results.

CTD-squared Pancancer Drug Activity DREAM Challenge

The Columbia Cancer Target Discovery and Development (CTD2) Center has developed Pancancer Analysis of Chemical Entity Activity (PanACEA), a database of dose-response curves and drug-perturbed RNAseq profiles for 400 clinical oncology drugs. We used this resource to host the CTD2 Pancancer Drug Activity DREAM Challenge, a crowdsourced competition to develop and benchmark computational models for the prediction of drug polypharmacology using drug sensitivity and gene expression information. We provided dose-response and drug-perturbed RNAseq data on 32 kinase inhibitors and asked the community to use this data to predict target binding across 255 kinases. Top performing teams employed two distinct strategies: simple similarity analysis using many highly curated training datasets, or more advanced deep-learning trained on a single large data set. Detailed analyses of the best performing methods provide (1) a framework for using pharmacogenomic data to predict drug-target interactions, (2) reconciliation of different “drug-target” gold-standard definitions, and (3) insights into therapeutically actionable associations between kinase signalling and transcriptional networks.

Predicting drug targets by integration of drug sensitivity and gene signature data - the NETPHAR strategy

Misidentifying a drug’s mechanism of action is a common problem in drug discovery. Despite recent efforts on profiling of transcriptomics changes after drug treatment, it remains unknown whether they can facilitate the prediction of drug targets. The CTD2 Pancancer Drug Activity DREAM Challenge provided dose-response and drug-gene signatures on 32 kinase inhibitors and asked the participants to predict binding targets of these anonymous drugs.

We have collected: 1) drug sensitivity data; 2) gene signature data and 3) drug-target interaction data. We utilized the DrugComb (http://drugcomb.fimm.fi), which is a crowd-sourcing database of comprehensive drug sensitivity data for combinatorial and monotherapy screenings. Furthermore, we determined the robust drug sensitivity metrics including IC20 and RI (relative inhibition) score, which is based on the area under the log10-scaled dose-response curves. Drug target interactions are derived from DrugTargetCommons (http://drugtargetcommons.fimm.fi/), which is a crowd-sourcing database to manually curate the drug-target bioactivity values from the literature. The final training dataset includes 116 drugs that have the cell line sensitivity features (d = 2*11), consensus gene expression signatures (d = 973, provided by organizers) as well as drug target profiles (d = 1259).

To determine the best machine learning models to predict the drug targets, we considered two classes of methods, including weighted averaging and regression. For weighted averaging, the prediction was made based on the multiplication of the Pearson correlation matrix and the drug-target interaction matrix; while for regression, we considered standard machine learning algorithms including ElasticNet, RandomForest and GBM, for which the model was trained on the n = 116 compounds that were found in the training set, and then tested on the n = 32 Challenge compounds. We found that regression methods produced less accurate results, probably due to overfitting. Instead, our weighted averaging method, which directly uses Pearson correlation to transform the original predictor space into a drug similarity space, seemed to produce superior performance.

In conclusion, we believed the hypothesis holds true that drug targets can be inferred from their drug responses and perturbational profiles, with the proper choice of data and model. Specifically, we found that RI and IC20 are robust estimates of the drug responses. Deeply-curated quantitative pharmacological databases (ie. DrugComb, DrugTargetCommons and L1000) pave ways for advanced pharmacological modelling which may help identify the mechanisms of drugs with improved accuracy.

Personalized prediction of on-off medication state from wearable-derived time-series features

Wearables hold potential for rich monitoring of patient state, particularly in chronic conditions such as Parkinson's disease. However, clinically useful information is difficult to extract due to the high dimensionality and large amounts of noise inherent to real world sensor data. In such data regimes, deep learning techniques can be susceptible to overfitting, and simpler techniques may actually be preferable. We developed a data pipeline to predict on-off states for Parkinson’s disease from wearable accelerometer data while minimizing overfitting. The input to our pipeline was raw sensor data consisting of triaxial acceleration time series signals measured from smartwatches. We combined Individual sensor axes and removed gravitational acceleration from the combined signal. We then extracted time-series features from the processed signal and fit a random forest to predict on-off state for each patient. To expand the training set, we divided each full-length observation into 10 second segments. Our pipeline generated predictions for each segment and used the ensembled median value as the prediction for the observation. This pipeline significantly outperformed the null model, as well as deep learning approaches, in both an internal validation and a held-out test set. Our approach emphasized parsimony and interpretability without sacrificing model performance.