Time series Gene Network Analyses of Explosive Breaching in U.S. Military Warfighters

Background: Injuries from exposure to blast explosions rose dramatically during the Iraq and Afghanistan wars due to increase use of IEDs resulting in blast-related neurotrauma. To investigate the effect of blast on gene regulatory networks, we use timeseries gene expression data from military “breachers” exposed to controlled, low-level blast explosives during training.
Methods: Blood samples were collected from male participants (age 30.2 ± 7.4 years) during a 3-day breacher training: baseline (day1), pre- and-post- breaching (day2), and follow-up (day3) at a U.S. Army training site. Blood samples were obtained and RNA-seq was performed for every time point. RNA-seq read counts were transformed to logCPM by fitting a linear model with the variable of interest (i.e., pre-post breaching operations) and adjusted for subject effect using limma. Potential batch effects was eliminated using surrogate variable analysis. To investigate whether exposure to explosive blast affects gene expression in a coordinated fashion, network analysis was performed via multiscale embedded gene co-expression network analysis (MEGENA) using variably expressed genes (sd >= 0.25) across all subjects/ time. Principal component (PC) analysis was performed on each network using a linear model and fitted on the 1st PC across timepoints for pre-post and pre-follow-up comparisons separately.
Results & Conclusions. To investigate whether prior exposure load including history of traumatic brain injury (TBI) or cumulative career breaching in military service impacts responsivity to an explosion acutely, interaction models including prior exposure load by time (pre-post breaching) analyses were performed. Compared to those with no history of TBI, those with prior TBI history showed blunted acute gene expression responsivity (i.e., no significant alterations) following exposure to blast explosives. Network analyses identified 5 unnested-networks with significant interaction between TBI/Breaching history and time (pre-post/follow-up, p<0.05) that included genes involved in inflammation and adaptive immunity. Semaphorin 7A (a hub gene within one of these networks) is a membrane associated protein that plays an important role in mediating the connection between the CNS and the immune system. Emerging data from both experimental and human studies have shown that a number of Sema loci are involved in response to CNS injury. Although SEMA7A has not been previously implicated in CNS injury, it shows increased expression following blast sub-acutely in subjects with prior history of TBI. These findings show the power of network approaches to identify transcriptional alterations directly in response to exposure to explosives, which is translationally significant in furthering our understanding of blast-related neurotrauma.

A new graph-based clustering method with application to single-cell RNA-seq data from human pancreatic islets

Traditional bulk RNA-sequencing of human pancreatic islets mainly reflects transcriptional response of major cell types. Single-cell RNA sequencing technology enables transcriptional characterization of individual cells, and thus makes it possible to detect cell types and subtypes. To tackle the heterogeneity of single-cell RNA-seq data, powerful and appropriate clustering is required to facilitate the discovery of cell types. In this paper, we propose a new clustering framework based on a graph-based model with various types of distances. We take the compositional nature of single-cell RNA-seq data into account and employ log-ratio transformations. The practical merit of the proposed method is demonstrated through the application to the centered log-ratio transformed single-cell RNA-seq data for human pancreatic islets. The practical merit is also demonstrated through comparisons with existing single-cell clustering methods.

A Novel Deep-Learning Based Approach to Identify, Localize, and Grade Malignant Prostate Lesions

Prostate cancer (PCa) is one of the most widespread and deadly cancers among American men. Currently, patients who are suspected of having PCa are recommended to undergo a biopsy, a mpMRI scan, or a MRI scan. These scans are often not reliable and can have a low detection rate. Furthermore, pathologists almost always have trouble grading the aggressiveness of the prostate tumor. These tumors are classified by the Gleason Grade scale or the ISUP Group Grade scale, which ranks the aggressiveness of the tumor on a scale from 1 to 5 where higher numbers represent a more dangerous tumor. To help identify, segment, and grade prostate tumors, recent studies point to the use of machine learning or artificial intelligence. We present a novel method to both grade and localize prostate tumors in multiparametric magnetic resonance images (mpMRI). Using a convolutional neural network (CNN) to identify mpMRI images that contained the tumors, a Faster region-based convolutional neural network (Faster RCNN) to segment the tumors, and a CNN to grade the tumors by differentiating tumors into high-grade and low-grade tumors, we present a novel method to localize and grade the tumors that performs at least as well as current machine learning techniques while using faster networks and requiring less data. Specifically, we used deep learning to segment the tumors, whereas previous studies had used machine learning, which is more computationally expensive than our method. Our CNN that identified whether mpMRI images had tumors or not had an accuracy of 98.7%. The Faster RCNN used had a sensitivity of 0.972, a false positive rate of 0.263, and an area under curve (AUC) of 0.9007. The CNN used to grade the tumors had an accuracy of 97.6%, which is similar to other machine learning models. In conclusion, these three models provide a novel approach to classifying and grading prostate tumors while requiring less data.

A mathematical model of the molecular mechanism of PZA in Mycobacterium tuberculosis

Pyrazinamide (PZA) is one of the most important drugs used in first and second-line treatments against tuberculosis to eliminate bacteria in latent state, reducing relapse incidence. Pyrazinoic acid (POA) is the active form of PZA, generated by hydrolysis of PZA mediated by the enzyme pyrazinamidase (PZase), encoded by the gene pncA. Its antibiotic effect requires an acidic environment, discovered by the high in vivo and poor in vitro sterilizing activity, potentially due the acidic environment experienced by M. tuberculosis inside granulomes.
The molecular targets of this drug are still unknown, and are thought to be more than one. The current mechanism, based on POA accumulation, states that PZA enters the cell through passive diffusion, is hydrolyzed to POA (mediated by PZase), and expelled to the extracellular environment by efflux pumps. Outside, expelled POA is protonated to HPOA in the acidic environment, and re-enters the bacteria by a membrane potential gradient. In this way, the external acidic environment helps to recover and accumulate intracellular POA, setting up a cycle that leads to a slight acidification of the cytoplasm, which together with the intracellular accumulation of POA act on several potential targets and metabolic pathways. However, cytoplasmic acidification is not strictly required. Previous studies have shown that overexpression of pncA or efflux pump inhibitors eliminate the requirement of an acidic environment for susceptibility in vitro.
To better understand the contribution of an acidic extracellular environment to the internal accumulation of POA, and to study alternative scenarios of resistance/susceptibility, we modelled the PZA/POA metabolic pathway using a system of nonlinear differential equations, fitting experimental parameters. Our results indicate that, in equilibrium (laboratory) conditions, POA accumulation is independent of external pH and only depends on the ratio between the rates of POA production and efflux. In addition, the acidic environment has a significant contribution in the internal accumulation of total-POA (defined as POA+HPOA) only when the ratio efflux-rate/diffusion of external POA and HPOA is greater than 1. Interestingly, the rate of POA production could help to increase the total-POA accumulation independently of the POA efflux rate and external pH. In addition, although low POA efflux rates accumulate more internal total-POA than an acidic environment, the cytoplasm will not be acidified, suggesting that a reduction of internal pH is not sufficient to produce the bactericidal effect. Finally, we performed simulations to evaluate possible consequences in growth rate under different PZA concentrations.

scFEA: A graph neural network model to estimate cell-wise metabolic using single cell RNA-seq data

The metabolic heterogeneity, and metabolic interplay between cells and their microenvironment have been known as significant contributors to disease treatment resistance. Our understanding of the intra-tissue metabolic heterogeneity and cooperation phenomena among cell populations is unfortunately quite limited, without a mature single cell metabolomics technology. To mitigate this knowledge gap, we developed a novel computational method, namely scFEA (single cell Flux Estimation Analysis), to infer single cell fluxome from single cell RNA-sequencing (scRNA-seq) data. scFEA is empowered by a comprehensively reorganized human metabolic map as focused metabolic modules, a novel probabilistic model to leverage the flux balance constraints on scRNA-seq data, and a novel graph neural network based optimization solver. The intricate information cascade from transcriptome to metabolome was captured using multi-layer neural networks to fully capitulate the non-linear dependency between enzymatic gene expressions and reaction rates. We experimentally validated scFEA by generating an scRNA-seq dataset with matched metabolomics data on cells of perturbed oxygen and genetic conditions. Application of scFEA on this dataset demonstrated the consistency between predicted flux and metabolic imbalance with the observed variation of metabolites in the matched metabolomics data. We also applied scFEA on publicly available single cell melanoma and head and neck cancer datasets, and discovered different metabolic landscapes between cancer and stromal cells. The cell-wise fluxome predicted by scFEA empowers a series of downstream analysis including identification of metabolic modules or cell groups that share common metabolic variations, sensitivity evaluation of enzymes with regards to their impact on the whole metabolic flux, and inference of cell-tissue and cell-cell metabolic communications.

A molecular taxonomy of tumors independent of tissue-of-origin

Cancer is an organism-level disease, impacting processes from cellular metabolism and the microenvironment to systemic immune response. Nevertheless, efforts to distinguish overarching mutational processes from interactions with the cell of origin for a tumor have seen limited success, presenting a barrier to individualized medicine. Here we present a novel, pathway-centric approach, extracting somatic mutational profiles within and between tissues, largely orthogonal to cell of origin, mutational burden, or stage. Known predisposition variants are equally distributed among clusters, and largely independent of molecular subtype. Prognosis and risk of death vary jointly by cancer type and cluster. Analysis of metastatic tumors reveals that differences are largely cluster-specific and complementary, implicating convergent mechanisms that combine familiar driver genes with diverse low-frequency lesions in tumor-promoting pathways, ultimately producing distinct molecular phenotypes. The results shed new light on the interplay between organism-level dysfunction and tissue-specific lesions.

Identification of 3'UTR Motif for Variant-Containing Genes Associated with Schizophrenia

Schizophrenia is a neurological disorder that affects behavior and emotions and can result in serious hallucinations and motor problems. Around 3.5 million Americans have been diagnosed with schizophrenia. the underlying disease mechanism of schizophrenia is still unknown.
We downloaded genome-wide association study (GWAS) results for schizophrenia from the Phenotype-Genotype Integrator (PheGenI). Using Enrichr, we inputted 1,453 genes associated with schizophrenia and discovered a pathway connection between the disease and ARCHS4 co-expression. We obtained 9 missense variant-containing genes and 14 3’UTR variant-containing genes associated with schizophrenia. Out of the total number of studied variants stated above, 3 missense variant-containing genes and 4 3’UTR variant-containing genes were reported to have ClinVar annotation significance (pathogenic/likely pathogenic). Protein-protein interaction (PPI) results from STRING Database indicate no evidence showing interacted genes overlapping across the 9 missense variant-containing genes. UCSC Genome Browser, we retrieved 3’UTR sequences for all 14 genes to run MEME software. We identified one promising motif present in all 14 3’UTR variant-containing genes. Currently, we are still investigating whether any of the 3’UTR variants are also contributing to functional importance, in the context of the discovered motif.

Identifying genetic determinants of transcriptional response dynamics at single cell resolution

Single cell technologies enabled gene expression analysis at single cell resolution to study heterogeneous cell populations across conditions and individuals; yet few computational methods exist to profile response dynamics. Beyond looking at the transcriptional response to stimuli for each cell-type separately, single cell data can capture more fine-grained details by analyzing trajectory dynamics, variance, other higher order moments or velocity. These response dynamics are crucial to gain a better understanding of the molecular underpinnings in inter-individual variation in drug response. Here we collected new data and developed a new supervised approach based on the linear discriminant analysis (LDA), to construct a low dimensional representation of the response for each cell-type. We then used the LDA axes to identify gene-by-environment interactions in the response dynamics to different stimuli for each cell-type. To evaluate the new method and to also compare to other dynamic metrics (dispersion and velocity), we activated peripheral blood mononuclear cells from 96 African Americans with phytohemagglutinin (PHA) or lipopolysaccharide (LPS), and treated with the glucocorticoid dexamethasone (DEX).
We performed scRNA-seq for 292,394 cells and identified four major cell types: B-cell, Monocyte, NK-cell and T-cell. We employed negative binomial distribution to calculate RNA expression dispersion (which is less correlated with mean expression than variance). We detected 2,585 genes with variable dispersion, most in monocytes and enriched in immune-related processes. Effects on dispersion induced by PHA and LPS are negatively correlated with those induced by DEX, which implies that DEX suppresses the activated immune response through effects on both gene expression mean and dispersion. Using scVelo we quantified and determined 1,706 genes with differential gene expression velocities between treatments (34% of them with variable dispersion). Our new approach (LDA) is able to capture gene expression dynamic changes in response to treatments in two components: LDA1 captures dynamic changes induced by DEX, while LDA2 tends to reflect dynamic changes induced by PHA and LPS. For dynamic interaction eQTL mapping, we identified 834 eQTLs interacting with LDA1 (39 genes) in the DEX groups, and 12,512 eQTLs interacting with LDA2 (2,531 genes) from PHA or LPS groups. We also discovered 130 dispersion eQTLs corresponding to 19 unique genes. Our results shed light on the dynamics of gene expression in response to stimuli and across individuals. These dynamic processes fundamentally regulate the immune system and may contribute to inter-individual variation in immunological processes and diseases.

Single Intestinal Epithelial Cell Transcriptomic Atlas of Treatment Naïve Patients with Crohn’s Disease reveals Altered Cellular Composition and Gene Expression

Crohn’s disease (CD) is a condition caused by an abnormal immune response to enteric microbiota in the gastrointestinal tract of a genetically susceptible host (1). Due to the high heterogeneity in disease location, behavior, and progression, precise clinical diagnosis of CD is challenging2. Furthermore, there are no reliable methods to aid prognosis, and there are no long-term treatments for CD, necessitating a better understanding of CD at a molecular level (2).

A key contributing factor to this chronic inflammatory condition is intestinal epithelial cell (IEC) barrier breakdown. IECs separate microbes in the lumen from the mucosal immune system, and multiple IEC subtypes compose the intestinal epithelium with distinct functions (3). In this study we generated single-cell RNA sequencing data from colonic IECs of 3 treatment naïve newly diagnosed adult CD and 4 healthy control (NIBD) patients to understand which IEC subtypes drive changes in gene expression and IEC function that lead to dysregulation and disease.

Previous studies have looked at single-cell transcriptomics of healthy colonic and ileal samples as well as those with ulcerative colitis and CD (4-6), but our study is the first to focus on the single-cell composition of colonic IECs from CD patients. We found cell types that were not identified previously, such as CEACAM7+ colonocytes, and confirmed previous findings, such as the existence of high BEST4 signal in an enterocyte cell subtype. Cellular composition was generally consistent between NIBD and CD samples, though CD patients had significantly more CA1+ late colonocytes and reduced numbers of stem cells. Furthermore, differences between CD and NIBD patients included a higher ratio of goblet cells to enteroendocrine cells in CD. We also see differences between CD and NIBD samples in expression levels of genes within each IEC subtype, and KEGG pathway analysis showed that genes upregulated in CD patients were overlapped with gene sets implicated in autoimmune diseases and infections (viral or bacterial). Overall, these differences may reflect altered IEC function in disease.

1. Sartor R.B. 2006. Nat Clin Pract Gastroenterol Hepatol 3(7):390-407.
2. Furey, T.S., Sethupathy, P., Sheikh, S.Z. 2019. Nat Rev Gastroenterol Hepatol 16, 296–311.
3. Peterson, L., Artis, D. 2014. Nat Rev Immunol 14, 141–153.
4. Haber, A. et al. 2017. Nature 551, 333–339.
5. Smillie, C.S. et al. 2019. Cell 178(3):714-730.e22.
6. Parikh, K. et al. 2019. Nature 567, 49–55.

Benchmarking Reverse-Complement Strategies for Deep Learning Models in Genomics

Predictive models that map double-stranded regulatory DNA to molecular signals of regulatory activity should, in principle, produce identical predictions regardless of whether the sequence of the forward strand or its reverse complement (RC) is supplied as input. Unfortunately, standard convolutional neural network architectures can produce highly divergent predictions across strands, even when the training set is augmented with RC sequences. Two strategies have emerged in the literature to enforce this symmetry: conjoined a.k.a. "siamese" architectures where the model is run in parallel on both strands & predictions are combined, and RC parameter sharing or RCPS where weight sharing ensures that the response of the model is equivariant across strands. However, systematic benchmarks are lacking, and neither architecture has been adapted to base-resolution signal profile prediction tasks. In this work, we extend conjoined and RCPS models to signal profile prediction, and introduce a strong baseline: a standard model (trained on RC augmented data) that is converted to a conjoined model only after it has been trained, which we call a "post-hoc" conjoined model. We then conduct benchmarks on both binary and signal profile prediction. We find post-hoc conjoined models consistently perform as well as or better than models that were conjoined during training, and present a mathematical intuition for why. We also find that - despite its theoretical appeal - RCPS performs surprisingly poorly on certain tasks, in particular, signal profile prediction. In fact, RCPS can sometimes do worse than even standard models trained with RC data augmentation. We prove that the RCPS models can represent the solution learned by the conjoined models, implying that the poor performance of RCPS may be due to optimization difficulties. We therefore suggest that users interested in RC symmetry should default to post-hoc conjoined models as a reliable baseline before exploring RCPS.