Time series Gene Network Analyses of Explosive Breaching in U.S. Military Warfighters

Background: Injuries from exposure to blast explosions rose dramatically during the Iraq and Afghanistan wars due to increase use of IEDs resulting in blast-related neurotrauma. To investigate the effect of blast on gene regulatory networks, we use timeseries gene expression data from military “breachers” exposed to controlled, low-level blast explosives during training.
Methods: Blood samples were collected from male participants (age 30.2 ± 7.4 years) during a 3-day breacher training: baseline (day1), pre- and-post- breaching (day2), and follow-up (day3) at a U.S. Army training site. Blood samples were obtained and RNA-seq was performed for every time point. RNA-seq read counts were transformed to logCPM by fitting a linear model with the variable of interest (i.e., pre-post breaching operations) and adjusted for subject effect using limma. Potential batch effects was eliminated using surrogate variable analysis. To investigate whether exposure to explosive blast affects gene expression in a coordinated fashion, network analysis was performed via multiscale embedded gene co-expression network analysis (MEGENA) using variably expressed genes (sd >= 0.25) across all subjects/ time. Principal component (PC) analysis was performed on each network using a linear model and fitted on the 1st PC across timepoints for pre-post and pre-follow-up comparisons separately.
Results & Conclusions. To investigate whether prior exposure load including history of traumatic brain injury (TBI) or cumulative career breaching in military service impacts responsivity to an explosion acutely, interaction models including prior exposure load by time (pre-post breaching) analyses were performed. Compared to those with no history of TBI, those with prior TBI history showed blunted acute gene expression responsivity (i.e., no significant alterations) following exposure to blast explosives. Network analyses identified 5 unnested-networks with significant interaction between TBI/Breaching history and time (pre-post/follow-up, p<0.05) that included genes involved in inflammation and adaptive immunity. Semaphorin 7A (a hub gene within one of these networks) is a membrane associated protein that plays an important role in mediating the connection between the CNS and the immune system. Emerging data from both experimental and human studies have shown that a number of Sema loci are involved in response to CNS injury. Although SEMA7A has not been previously implicated in CNS injury, it shows increased expression following blast sub-acutely in subjects with prior history of TBI. These findings show the power of network approaches to identify transcriptional alterations directly in response to exposure to explosives, which is translationally significant in furthering our understanding of blast-related neurotrauma.

A new graph-based clustering method with application to single-cell RNA-seq data from human pancreatic islets

Traditional bulk RNA-sequencing of human pancreatic islets mainly reflects transcriptional response of major cell types. Single-cell RNA sequencing technology enables transcriptional characterization of individual cells, and thus makes it possible to detect cell types and subtypes. To tackle the heterogeneity of single-cell RNA-seq data, powerful and appropriate clustering is required to facilitate the discovery of cell types. In this paper, we propose a new clustering framework based on a graph-based model with various types of distances. We take the compositional nature of single-cell RNA-seq data into account and employ log-ratio transformations. The practical merit of the proposed method is demonstrated through the application to the centered log-ratio transformed single-cell RNA-seq data for human pancreatic islets. The practical merit is also demonstrated through comparisons with existing single-cell clustering methods.

A Novel Deep-Learning Based Approach to Identify, Localize, and Grade Malignant Prostate Lesions

Prostate cancer (PCa) is one of the most widespread and deadly cancers among American men. Currently, patients who are suspected of having PCa are recommended to undergo a biopsy, a mpMRI scan, or a MRI scan. These scans are often not reliable and can have a low detection rate. Furthermore, pathologists almost always have trouble grading the aggressiveness of the prostate tumor. These tumors are classified by the Gleason Grade scale or the ISUP Group Grade scale, which ranks the aggressiveness of the tumor on a scale from 1 to 5 where higher numbers represent a more dangerous tumor. To help identify, segment, and grade prostate tumors, recent studies point to the use of machine learning or artificial intelligence. We present a novel method to both grade and localize prostate tumors in multiparametric magnetic resonance images (mpMRI). Using a convolutional neural network (CNN) to identify mpMRI images that contained the tumors, a Faster region-based convolutional neural network (Faster RCNN) to segment the tumors, and a CNN to grade the tumors by differentiating tumors into high-grade and low-grade tumors, we present a novel method to localize and grade the tumors that performs at least as well as current machine learning techniques while using faster networks and requiring less data. Specifically, we used deep learning to segment the tumors, whereas previous studies had used machine learning, which is more computationally expensive than our method. Our CNN that identified whether mpMRI images had tumors or not had an accuracy of 98.7%. The Faster RCNN used had a sensitivity of 0.972, a false positive rate of 0.263, and an area under curve (AUC) of 0.9007. The CNN used to grade the tumors had an accuracy of 97.6%, which is similar to other machine learning models. In conclusion, these three models provide a novel approach to classifying and grading prostate tumors while requiring less data.

A mathematical model of the molecular mechanism of PZA in Mycobacterium tuberculosis

Pyrazinamide (PZA) is one of the most important drugs used in first and second-line treatments against tuberculosis to eliminate bacteria in latent state, reducing relapse incidence. Pyrazinoic acid (POA) is the active form of PZA, generated by hydrolysis of PZA mediated by the enzyme pyrazinamidase (PZase), encoded by the gene pncA. Its antibiotic effect requires an acidic environment, discovered by the high in vivo and poor in vitro sterilizing activity, potentially due the acidic environment experienced by M. tuberculosis inside granulomes.
The molecular targets of this drug are still unknown, and are thought to be more than one. The current mechanism, based on POA accumulation, states that PZA enters the cell through passive diffusion, is hydrolyzed to POA (mediated by PZase), and expelled to the extracellular environment by efflux pumps. Outside, expelled POA is protonated to HPOA in the acidic environment, and re-enters the bacteria by a membrane potential gradient. In this way, the external acidic environment helps to recover and accumulate intracellular POA, setting up a cycle that leads to a slight acidification of the cytoplasm, which together with the intracellular accumulation of POA act on several potential targets and metabolic pathways. However, cytoplasmic acidification is not strictly required. Previous studies have shown that overexpression of pncA or efflux pump inhibitors eliminate the requirement of an acidic environment for susceptibility in vitro.
To better understand the contribution of an acidic extracellular environment to the internal accumulation of POA, and to study alternative scenarios of resistance/susceptibility, we modelled the PZA/POA metabolic pathway using a system of nonlinear differential equations, fitting experimental parameters. Our results indicate that, in equilibrium (laboratory) conditions, POA accumulation is independent of external pH and only depends on the ratio between the rates of POA production and efflux. In addition, the acidic environment has a significant contribution in the internal accumulation of total-POA (defined as POA+HPOA) only when the ratio efflux-rate/diffusion of external POA and HPOA is greater than 1. Interestingly, the rate of POA production could help to increase the total-POA accumulation independently of the POA efflux rate and external pH. In addition, although low POA efflux rates accumulate more internal total-POA than an acidic environment, the cytoplasm will not be acidified, suggesting that a reduction of internal pH is not sufficient to produce the bactericidal effect. Finally, we performed simulations to evaluate possible consequences in growth rate under different PZA concentrations.

scFEA: A graph neural network model to estimate cell-wise metabolic using single cell RNA-seq data

The metabolic heterogeneity, and metabolic interplay between cells and their microenvironment have been known as significant contributors to disease treatment resistance. Our understanding of the intra-tissue metabolic heterogeneity and cooperation phenomena among cell populations is unfortunately quite limited, without a mature single cell metabolomics technology. To mitigate this knowledge gap, we developed a novel computational method, namely scFEA (single cell Flux Estimation Analysis), to infer single cell fluxome from single cell RNA-sequencing (scRNA-seq) data. scFEA is empowered by a comprehensively reorganized human metabolic map as focused metabolic modules, a novel probabilistic model to leverage the flux balance constraints on scRNA-seq data, and a novel graph neural network based optimization solver. The intricate information cascade from transcriptome to metabolome was captured using multi-layer neural networks to fully capitulate the non-linear dependency between enzymatic gene expressions and reaction rates. We experimentally validated scFEA by generating an scRNA-seq dataset with matched metabolomics data on cells of perturbed oxygen and genetic conditions. Application of scFEA on this dataset demonstrated the consistency between predicted flux and metabolic imbalance with the observed variation of metabolites in the matched metabolomics data. We also applied scFEA on publicly available single cell melanoma and head and neck cancer datasets, and discovered different metabolic landscapes between cancer and stromal cells. The cell-wise fluxome predicted by scFEA empowers a series of downstream analysis including identification of metabolic modules or cell groups that share common metabolic variations, sensitivity evaluation of enzymes with regards to their impact on the whole metabolic flux, and inference of cell-tissue and cell-cell metabolic communications.

A molecular taxonomy of tumors independent of tissue-of-origin

Cancer is an organism-level disease, impacting processes from cellular metabolism and the microenvironment to systemic immune response. Nevertheless, efforts to distinguish overarching mutational processes from interactions with the cell of origin for a tumor have seen limited success, presenting a barrier to individualized medicine. Here we present a novel, pathway-centric approach, extracting somatic mutational profiles within and between tissues, largely orthogonal to cell of origin, mutational burden, or stage. Known predisposition variants are equally distributed among clusters, and largely independent of molecular subtype. Prognosis and risk of death vary jointly by cancer type and cluster. Analysis of metastatic tumors reveals that differences are largely cluster-specific and complementary, implicating convergent mechanisms that combine familiar driver genes with diverse low-frequency lesions in tumor-promoting pathways, ultimately producing distinct molecular phenotypes. The results shed new light on the interplay between organism-level dysfunction and tissue-specific lesions.

Predicting genome-wide DNA replication timing from sequence features

DNA replication in eukaryotic cells duplicates the genome during cell division with a highly regulated temporal order. Proper replication timing (RT) control is of vital importance to maintain the composition of the epigenome (such as 3D genome organization) and gene transcription. However, our understanding of the genomic sequence determinants regulating DNA replication timing remains surprisingly limited. A major algorithmic challenge is to delineate a series of potential sequence determinants in shaping the RT programs over large-scale sequence domains.

Here we develop a new method, named CONCERT, to simultaneously predict RT from sequence features and identify genomic sequence elements that modulate RT in a genome-wide manner. CONCERT integrates two functionally cooperative modules, a selector and a predictor, which are trained jointly. The selector module performs importance estimation-based subset sampling of the genomic sequences to detect predictive elements. Utilizing sequence importance estimation from the selector, the predictor module incorporates the bi-directional recurrent neural networks and the self-attention mechanism to achieve selective learning of long-range spatial dependencies across genomic loci and context-aware feature representation learning of genomic sequences. The model also employs a hierarchical structure for capturing genomic context information at different scales.

We applied CONCERT to predict RT in mouse embryonic stem cells (mESCs) and human cell types. Using only the genomic sequences as input, CONCERT reaches above 0.90 Pearson correlation coefficients (PCCs) for RT prediction in mESCs and 0.80-0.88 PCCs in seven human cell types. Importantly, the identified predictive genomic elements exhibit strong correlations with specific types of genomic features including repetitive elements and cis-regulatory elements, revealing sequence properties that may regulate RT. In particular, each of the five early replication control elements (ERCEs) in mESCs that were experimentally validated through CRISPR-mediated deletions in a recent study (Sima et al., Cell 2019) can be reliably identified by our method. Furthermore, by applying to multiple human cell types, CONCERT delineated conserved and cell type-specific sequence elements that may play key roles in RT regulation.

Taken together, CONCERT provides a generic interpretable machine learning framework for predicting large-scale genomic profiles based on sequence features and provides new insights into the potential sequence determinants of the DNA replication timing program.

Detecting higher-order structural changes in 3D genome organization with multi-task matrix factorization

Three-dimensional (3D) genome organization, which determines how the DNA is packaged inside the nucleus, has emerged as a key regulatory mechanism of cellular processes. High-throughput chromosomal conformation capture (Hi-C) technologies have enabled the study of 3D genome organization by experimentally measuring interactions among genomic regions in 3D space. Analysis of Hi-C data has revealed higher-order organizational units at multiple resolutions: chromosomal territories, compartments, and topologically associating domains (TADs). Changes or disruptions to such structures have been associated with disease, developmental, and evolutionary processes. Therefore, a key problem is to systematically detect higher-order structural changes across Hi-C datasets from multiple conditions. Existing methods to detect changes in 3D genome organization either do not model higher-order structural units, are specialized to one type of unit (e.g., TADs), or only compare pairs of Hi-C datasets. We address these limitations with Tree-structured Graph-regularized Integrated Factorization (TGIF), a new multi-task Non-negative Matrix Factorization (NMF) approach. TGIF makes use of complex tree-structured relationships among multiple Hi-C datasets such that closely related tasks, one for each Hi-C matrix, have similar lower-dimensional factors. The factors can be further constrained with task-specific graph regularization and are used to extract clusters of genomic regions with dynamically changing interaction profiles across tasks. We applied TGIF to simulated data and in real Hi-C data from cancer cell lines and mouse neural development process. TGIF effectively recovers ground-truth clusters in simulated data even with a large amount of noise and sparsity. When applied to genome-wide Hi-C matrices from karyotypically normal hematopoietic stem and progenitor cells (HSPC) and two chronic myelogenous leukemia (CML) cell lines (K562 and KBM7), TGIF detects the Philadelphia translocation, a large reciprocal translocation between chr9 and chr22 used in the diagnosis of CML. In per-chromosome Hi-C matrices from three cell states during mouse neural development (embryonic stem cell, neural progenitors, and cortical neurons), TGIF is able to identify compartmental switches as well as local TAD shifts accompanying change in nearby gene expression. Taken together, TGIF provides a powerful multi-task framework to study the dynamics and context-specificity of 3D genome organization.

Multinomial CNNs for Mechanistic Modeling of Enhancer Activities

Predicting a sequence’s enhancer activity, defined as the extent to which it changes a gene’s expression, is a fundamental objective of regulatory genomics. We expect an enhancer activity model to reveal at least two pieces of mechanistic information. First, it should identify the putative binding sites for regulatory transcription factors (TFs) in the modeled sequences. Secondly, it should infer the corresponding TFs’ regulatory effects and integrate them to model the sequences’ enhancer activities. Unfortunately, human regulatory genomics still lacks models that meet the above expectations. Convolutional neural networks (CNNs) show high accuracy, and their post hoc analysis reveals position weight matrices that characterize potential TF binding sites in the sequences. However, interpreting the deep architectures in a biophysical manner remains difficult. It is also unclear if a biophysical model could use these post hoc discovered motifs toward similar accuracy. Other models in this realm leverage features, such as k-mers, evolutionary conservation, GC-content, and epigenetic features, that are difficult to relate to TF-DNA binding or TF’s regulatory effect on enhancer activity.

To address this gap, we propose MuSeAM (Multinomial CNNs for Sequence Activity Modeling), a CNN model that learns convolutions as multinomial distributions over the four nucleotides. The multinomial convolutions are directly interpretable (without post hoc analysis) as motifs of TF-DNA binding specificity that have been classically used in biophysical modeling of genomic sequences. Convolving a sequence with these multinomial convolutions gives us likelihood terms, which we use in a linear regression model to fit the sequence’s enhancer activity. The model’s linear coefficients represent the regulatory effects and strength of the TFs, yielding an overall mechanistically interpretable enhancer activity model. MuSeAM is customizable; one can replace linear regression with more mechanistic statistical thermodynamic functions.

We applied MuSeAM on data from a massively parallel reporter assay of human genome sequences in human liver cells. MuSeAM achieved state-of-the-art performance in modeling this data, while also learning multinomial convolutions that recapitulate motifs of TFs known to be active in the human liver, their known regulatory roles, and motif co-occurrence patterns that reflect known TF-TF interactions. The trained MuSeAM model showed high generalization on unseen tasks such as predicting chromatin accessibility and prioritizing functional single-nucleotide polymorphisms (SNPs). The prioritized SNPs are enriched for low minor allele frequencies and occur within promoters and enhancers, highlighting the model’s credibility. We believe the multinomial convolution approach will be transformative in building mechanistically interpretable models of genomic sequences.

Identifying Open Chromatin Regions Associated with Mammalian Phenotypes that Have Evolved through Gene Expression

Many phenotypes, including vocal learning, longevity, and brain size, have evolved at least in part through gene expression, meaning that some of their differences across species are caused by differences in genome sequence at enhancers. While some of the genes involved in these phenotypes have been identified, in most cases, it remains unknown which enhancers are responsible and how genome sequence differences in those enhancers have led to differences in gene expression. We used open chromatin regions (OCRs), which can serve as a proxy for enhancers, from datasets that we and others generated from diverse mammalian brains and livers to train machine learning models that predict brain and liver open chromatin status from genome sequences at orthologs of OCRs. We found that our models could accurately predict lineage-specific and tissue-specific OCR ortholog activity, that our predictions across species tended to match our expectations based on the phylogenetic tree, and that our predictions more accurately predict open chromatin status conservation than predictions made using conservation scores.

We applied our models to make brain and liver OCR ortholog open chromatin status predictions in hundreds of mammals. We found distinct clusters of OCR orthologs based on their predicted open chromatin status in brain and liver. For example, we found clusters of OCR orthologs that are predicted to be closed specific lineages, such as bats or cetaceans, and clusters of OCR orthologs whose open chromatin status seems to have evolved convergently, such as OCR orthologs that are open in brain or liver in only primates and ungulates or only primates and carnivora. We also associated our predictions with cross-species brain and liver phenotype annotations in a way that accounts for phylogeny to identify candidate OCRs that might be involved in these phenotypes. For example, human OCR orthologs associated with longevity are enriched for occurring near genes involved in response to epinephrine and negative regulation of oxidative stress-induced neuron death. Our approach to identifying OCR orthologs associated with phenotypes that have evolved through gene expression can be applied to phenotypes associated with any tissue or cell type that has open chromatin data from at least a few species.