Time series Gene Network Analyses of Explosive Breaching in U.S. Military Warfighters
Background: Injuries from exposure to blast explosions rose dramatically during the Iraq and Afghanistan wars due to increase use of IEDs resulting in blast-related neurotrauma. To investigate the effect of blast on gene regulatory networks, we use timeseries gene expression data from military “breachers” exposed to controlled, low-level blast explosives during training.
Methods: Blood samples were collected from male participants (age 30.2 ± 7.4 years) during a 3-day breacher training: baseline (day1), pre- and-post- breaching (day2), and follow-up (day3) at a U.S. Army training site. Blood samples were obtained and RNA-seq was performed for every time point. RNA-seq read counts were transformed to logCPM by fitting a linear model with the variable of interest (i.e., pre-post breaching operations) and adjusted for subject effect using limma. Potential batch effects was eliminated using surrogate variable analysis. To investigate whether exposure to explosive blast affects gene expression in a coordinated fashion, network analysis was performed via multiscale embedded gene co-expression network analysis (MEGENA) using variably expressed genes (sd >= 0.25) across all subjects/ time. Principal component (PC) analysis was performed on each network using a linear model and fitted on the 1st PC across timepoints for pre-post and pre-follow-up comparisons separately.
Results & Conclusions. To investigate whether prior exposure load including history of traumatic brain injury (TBI) or cumulative career breaching in military service impacts responsivity to an explosion acutely, interaction models including prior exposure load by time (pre-post breaching) analyses were performed. Compared to those with no history of TBI, those with prior TBI history showed blunted acute gene expression responsivity (i.e., no significant alterations) following exposure to blast explosives. Network analyses identified 5 unnested-networks with significant interaction between TBI/Breaching history and time (pre-post/follow-up, p<0.05) that included genes involved in inflammation and adaptive immunity. Semaphorin 7A (a hub gene within one of these networks) is a membrane associated protein that plays an important role in mediating the connection between the CNS and the immune system. Emerging data from both experimental and human studies have shown that a number of Sema loci are involved in response to CNS injury. Although SEMA7A has not been previously implicated in CNS injury, it shows increased expression following blast sub-acutely in subjects with prior history of TBI. These findings show the power of network approaches to identify transcriptional alterations directly in response to exposure to explosives, which is translationally significant in furthering our understanding of blast-related neurotrauma.
A new graph-based clustering method with application to single-cell RNA-seq data from human pancreatic islets
Traditional bulk RNA-sequencing of human pancreatic islets mainly reflects transcriptional response of major cell types. Single-cell RNA sequencing technology enables transcriptional characterization of individual cells, and thus makes it possible to detect cell types and subtypes. To tackle the heterogeneity of single-cell RNA-seq data, powerful and appropriate clustering is required to facilitate the discovery of cell types. In this paper, we propose a new clustering framework based on a graph-based model with various types of distances. We take the compositional nature of single-cell RNA-seq data into account and employ log-ratio transformations. The practical merit of the proposed method is demonstrated through the application to the centered log-ratio transformed single-cell RNA-seq data for human pancreatic islets. The practical merit is also demonstrated through comparisons with existing single-cell clustering methods.
A Novel Deep-Learning Based Approach to Identify, Localize, and Grade Malignant Prostate Lesions
Prostate cancer (PCa) is one of the most widespread and deadly cancers among American men. Currently, patients who are suspected of having PCa are recommended to undergo a biopsy, a mpMRI scan, or a MRI scan. These scans are often not reliable and can have a low detection rate. Furthermore, pathologists almost always have trouble grading the aggressiveness of the prostate tumor. These tumors are classified by the Gleason Grade scale or the ISUP Group Grade scale, which ranks the aggressiveness of the tumor on a scale from 1 to 5 where higher numbers represent a more dangerous tumor. To help identify, segment, and grade prostate tumors, recent studies point to the use of machine learning or artificial intelligence. We present a novel method to both grade and localize prostate tumors in multiparametric magnetic resonance images (mpMRI). Using a convolutional neural network (CNN) to identify mpMRI images that contained the tumors, a Faster region-based convolutional neural network (Faster RCNN) to segment the tumors, and a CNN to grade the tumors by differentiating tumors into high-grade and low-grade tumors, we present a novel method to localize and grade the tumors that performs at least as well as current machine learning techniques while using faster networks and requiring less data. Specifically, we used deep learning to segment the tumors, whereas previous studies had used machine learning, which is more computationally expensive than our method. Our CNN that identified whether mpMRI images had tumors or not had an accuracy of 98.7%. The Faster RCNN used had a sensitivity of 0.972, a false positive rate of 0.263, and an area under curve (AUC) of 0.9007. The CNN used to grade the tumors had an accuracy of 97.6%, which is similar to other machine learning models. In conclusion, these three models provide a novel approach to classifying and grading prostate tumors while requiring less data.
A mathematical model of the molecular mechanism of PZA in Mycobacterium tuberculosis
Pyrazinamide (PZA) is one of the most important drugs used in first and second-line treatments against tuberculosis to eliminate bacteria in latent state, reducing relapse incidence. Pyrazinoic acid (POA) is the active form of PZA, generated by hydrolysis of PZA mediated by the enzyme pyrazinamidase (PZase), encoded by the gene pncA. Its antibiotic effect requires an acidic environment, discovered by the high in vivo and poor in vitro sterilizing activity, potentially due the acidic environment experienced by M. tuberculosis inside granulomes.
The molecular targets of this drug are still unknown, and are thought to be more than one. The current mechanism, based on POA accumulation, states that PZA enters the cell through passive diffusion, is hydrolyzed to POA (mediated by PZase), and expelled to the extracellular environment by efflux pumps. Outside, expelled POA is protonated to HPOA in the acidic environment, and re-enters the bacteria by a membrane potential gradient. In this way, the external acidic environment helps to recover and accumulate intracellular POA, setting up a cycle that leads to a slight acidification of the cytoplasm, which together with the intracellular accumulation of POA act on several potential targets and metabolic pathways. However, cytoplasmic acidification is not strictly required. Previous studies have shown that overexpression of pncA or efflux pump inhibitors eliminate the requirement of an acidic environment for susceptibility in vitro.
To better understand the contribution of an acidic extracellular environment to the internal accumulation of POA, and to study alternative scenarios of resistance/susceptibility, we modelled the PZA/POA metabolic pathway using a system of nonlinear differential equations, fitting experimental parameters. Our results indicate that, in equilibrium (laboratory) conditions, POA accumulation is independent of external pH and only depends on the ratio between the rates of POA production and efflux. In addition, the acidic environment has a significant contribution in the internal accumulation of total-POA (defined as POA+HPOA) only when the ratio efflux-rate/diffusion of external POA and HPOA is greater than 1. Interestingly, the rate of POA production could help to increase the total-POA accumulation independently of the POA efflux rate and external pH. In addition, although low POA efflux rates accumulate more internal total-POA than an acidic environment, the cytoplasm will not be acidified, suggesting that a reduction of internal pH is not sufficient to produce the bactericidal effect. Finally, we performed simulations to evaluate possible consequences in growth rate under different PZA concentrations.
scFEA: A graph neural network model to estimate cell-wise metabolic using single cell RNA-seq data
The metabolic heterogeneity, and metabolic interplay between cells and their microenvironment have been known as significant contributors to disease treatment resistance. Our understanding of the intra-tissue metabolic heterogeneity and cooperation phenomena among cell populations is unfortunately quite limited, without a mature single cell metabolomics technology. To mitigate this knowledge gap, we developed a novel computational method, namely scFEA (single cell Flux Estimation Analysis), to infer single cell fluxome from single cell RNA-sequencing (scRNA-seq) data. scFEA is empowered by a comprehensively reorganized human metabolic map as focused metabolic modules, a novel probabilistic model to leverage the flux balance constraints on scRNA-seq data, and a novel graph neural network based optimization solver. The intricate information cascade from transcriptome to metabolome was captured using multi-layer neural networks to fully capitulate the non-linear dependency between enzymatic gene expressions and reaction rates. We experimentally validated scFEA by generating an scRNA-seq dataset with matched metabolomics data on cells of perturbed oxygen and genetic conditions. Application of scFEA on this dataset demonstrated the consistency between predicted flux and metabolic imbalance with the observed variation of metabolites in the matched metabolomics data. We also applied scFEA on publicly available single cell melanoma and head and neck cancer datasets, and discovered different metabolic landscapes between cancer and stromal cells. The cell-wise fluxome predicted by scFEA empowers a series of downstream analysis including identification of metabolic modules or cell groups that share common metabolic variations, sensitivity evaluation of enzymes with regards to their impact on the whole metabolic flux, and inference of cell-tissue and cell-cell metabolic communications.
A molecular taxonomy of tumors independent of tissue-of-origin
Cancer is an organism-level disease, impacting processes from cellular metabolism and the microenvironment to systemic immune response. Nevertheless, efforts to distinguish overarching mutational processes from interactions with the cell of origin for a tumor have seen limited success, presenting a barrier to individualized medicine. Here we present a novel, pathway-centric approach, extracting somatic mutational profiles within and between tissues, largely orthogonal to cell of origin, mutational burden, or stage. Known predisposition variants are equally distributed among clusters, and largely independent of molecular subtype. Prognosis and risk of death vary jointly by cancer type and cluster. Analysis of metastatic tumors reveals that differences are largely cluster-specific and complementary, implicating convergent mechanisms that combine familiar driver genes with diverse low-frequency lesions in tumor-promoting pathways, ultimately producing distinct molecular phenotypes. The results shed new light on the interplay between organism-level dysfunction and tissue-specific lesions.
Continuous chromatin state feature annotation of the human epigenome
Sequencing-based genomics assays can measure many types of genomic biochemical activity, including transcription factor binding, chromatin accessibility, transcription, and histone modifications. Data from sequencing-based genomics assays is now available from hundreds of human cellular conditions, including varying tissues, individuals, disease states, and drug perturbations. Semi-automated genome annotation (SAGA) methods are widely used to understand genome activity and gene regulation. These algorithms take as input a collection of sequencing-based genomics data sets from a particular tissue. They output an annotation of the genome that assigns a label to each genomic position.
All existing SAGA methods output a discrete annotation that assigns a single label to each position. This discrete annotation strategy has several limitations. First, discrete annotations cannot represent the strength of genomic elements. Variation among genomic elements in intensity or frequency of activity of cells in the sample is captured in variation in the intensity of the associated marks. Such variation is lost if all such elements are assigned the same label. Second, a discrete annotation cannot represent combinatorial elements that simultaneously exhibit multiple types of activity. To model combinatorial activity, a discrete annotation must use a separate label to represent each pair (or triplet etc.) of activity types.
We propose a method that uses a Kalman filter state space model to efficiently annotate the genome with chromatin state features. That is, our method outputs a vector of real-valued chromatin state features for each genomic position, where each chromatin state feature putatively represents a different type of activity. Continuous chromatin state features have a number of benefits over discrete labels. First, these features preserve the underlying continuous nature of the input signal tracks. Second, in contrast to discrete labels, continuous features can easily capture the strength of a given element. Third, continuous features can easily handle positions with combinatorial activity by assigning a high weight to multiple features. Fourth, chromatin state features lend themselves to expressive visualizations because they project complex data sets onto a small number of dimensions.
We also propose several measures of the quality of a chromatin state feature annotation relative to genes and gene expression, and we compare epigenome-ssm to existing SAGA methods (ChromHMM, Segway) according to these quality measures. We found that epigenome-ssm produces the highest-quality annotations of the methods we compared. Particularly, an epigenome-ssm annotation is a better predictor of gene expression and enhancer activity.
Learning a genome-wide score of human-mouse conservation at the functional genomics level
Identifying genomic regions with functional genomic properties that are conserved between human and mouse is an important challenge in the context of mouse model studies. To address this, we develop a computational method, Learning Evidence of Conservation from Integrated Functional genomic annotations (LECIF). LECIF integrates thousands of human and mouse functional genomic annotations to learn a score of evidence for conservation at the functional genomics level. While LECIF leverages data from diverse sources, it does not require explicit matching of experiments across species by biological origin or type.
To do so, LECIF trains an ensemble of neural networks that take human and mouse regions aligning at the sequence level as positive pairs and randomly matched human and mouse regions that do not align to each other but somewhere else in the other species as negative pairs. Each human or mouse region is characterized by a species-specific feature vector with thousands of values corresponding to ChromHMM chromatin state annotations, signals from RNA-seq experiments, and peak calls from DNase-seq, transcription factor ChIP-seq, histone mark ChIP-seq, and CAGE experiments. We used the trained classifier used to provide a LECIF score for all aligning regions.
The LECIF score is highly predictive of pairs of regions that align at the sequence level, even when controlling for properties of regions that align in general. The score captures correspondence of biologically similar annotations between human and mouse, even though LECIF was not explicitly given such information. While the LECIF score is moderately correlated with sequence constraint scores, it captures distinct
information on conserved properties. The LECIF score is higher in regions previously shown to have similar phenotypic properties in human and mouse at the genetic and epigenetic level. We expect the human-mouse LECIF score will be an important a resource for studies using mouse as a model organism.
Single cell RNAseq and machine learning reveal novel subpopulations and regulatory networks in low-grade inflammatory monocytes
Monocyte is a key innate immune cell type modulating diverse host inflammatory responses. Subclinical doses of LPS (SD-LPS) are known to causes low-grade inflammation in monocytes, which could lead to inflammatory diseases including atherosclerosis and metabolic syndrome. Sodium 4-phenylbutyrate (4-PBA) is a potential therapeutic compound which can reduce the inflammation caused by SD-LPS. In this study, we aim to understand the gene regulatory networks of monocyte under low-grade inflammation and the mechanism of action for 4-PBA by integrating single cell RNA sequencing (scRNAseq), transcription factor binding motifs and ATAC-seq data using machine learning. We have generated scRNAseq data from mouse monocytes treated with PBS, SD-LPS, 4-PBA, and SD-LPS + 4-PBA and identified 11 clusters in the single-cell RNA-seq data from these four conditions. A machine learning method, based on guided, regularized random forest (GRRF) and feature selection was developed to select best candidate TFs that are involved in this immune response. Our method achieved high auROC, auPRC, and F1 scores in testing dataset and outperformed traditional enrichment-based methods. In particular, among 531 candidate TFs, our method achieves an auROC of 0.961 with only 10 motifs for one of the 11 clusters. Our method is particularly efficient in selecting a few candidate genes to explain observed expression pattern. For example, our GRRF method achieved auROC=0.90 using only three TFs whereas enrichment method required seven TFs to achieve the same performance. Finally, we found two novel subpopulations of monocyte cells in response to LD-LPS and we confirmed our analysis using independent flow cytometry experiments. Our results suggest our new machine learning method can select candidate regulatory genes as potential targets for developing new therapeutics against low grade inflammation.
Predicting genome-wide DNA replication timing from sequence features
DNA replication in eukaryotic cells duplicates the genome during cell division with a highly regulated temporal order. Proper replication timing (RT) control is of vital importance to maintain the composition of the epigenome (such as 3D genome organization) and gene transcription. However, our understanding of the genomic sequence determinants regulating DNA replication timing remains surprisingly limited. A major algorithmic challenge is to delineate a series of potential sequence determinants in shaping the RT programs over large-scale sequence domains.
Here we develop a new method, named CONCERT, to simultaneously predict RT from sequence features and identify genomic sequence elements that modulate RT in a genome-wide manner. CONCERT integrates two functionally cooperative modules, a selector and a predictor, which are trained jointly. The selector module performs importance estimation-based subset sampling of the genomic sequences to detect predictive elements. Utilizing sequence importance estimation from the selector, the predictor module incorporates the bi-directional recurrent neural networks and the self-attention mechanism to achieve selective learning of long-range spatial dependencies across genomic loci and context-aware feature representation learning of genomic sequences. The model also employs a hierarchical structure for capturing genomic context information at different scales.
We applied CONCERT to predict RT in mouse embryonic stem cells (mESCs) and human cell types. Using only the genomic sequences as input, CONCERT reaches above 0.90 Pearson correlation coefficients (PCCs) for RT prediction in mESCs and 0.80-0.88 PCCs in seven human cell types. Importantly, the identified predictive genomic elements exhibit strong correlations with specific types of genomic features including repetitive elements and cis-regulatory elements, revealing sequence properties that may regulate RT. In particular, each of the five early replication control elements (ERCEs) in mESCs that were experimentally validated through CRISPR-mediated deletions in a recent study (Sima et al., Cell 2019) can be reliably identified by our method. Furthermore, by applying to multiple human cell types, CONCERT delineated conserved and cell type-specific sequence elements that may play key roles in RT regulation.
Taken together, CONCERT provides a generic interpretable machine learning framework for predicting large-scale genomic profiles based on sequence features and provides new insights into the potential sequence determinants of the DNA replication timing program.