Review

Deep learning approaches for integrative genomic and epigenomic data analysis

Heikham Russiachand Singh 1*

1 Department of Plant Science, McGill University, Raymond Building, 21111, Lakeshore Road, Ste. Anne de Bellevue, Quebec, Canada.

* Correspondence: heikham.singh@mcgill.ca (H.R.S.)


Citation: Singh, H.R. Deep learning approaches for integrative genomic and epigenomic data analysis. Glob. Jour. Bas. Sci. 2025, 2(1). 1-9.

Received: August 16, 2025

Revised: October 13, 2025

Accepted: November 04, 2025

Published: November 06, 2025

doi: 10.63454/jbs20000067

ISSN: 3049-3315

Volume 2; Issue 1

Download PDF file


Abstract: Advances in high-throughput sequencing technologies have enabled the comprehensive profiling of genomic and epigenomic landscapes across diverse biological systems. However, the massive scale, heterogeneity, and complexity of multi-omics datasets pose significant computational challenges. Deep learning has emerged as a transformative approach for extracting meaningful patterns from large-scale genomic and epigenomic data, enabling improved prediction of gene regulation, chromatin architecture, disease associations, and functional genomic elements. This review provides a comprehensive overview of deep learning methodologies applied to integrative genomic and epigenomic data analysis, including convolutional neural networks, recurrent neural networks, transformers, graph neural networks, and autoencoders. We discuss applications in regulatory element prediction, chromatin state modeling, 3D genome organization, disease genomics, and personalized medicine. Challenges, interpretability issues, and future perspectives for deep learning-driven genomics are also highlighted.

Keywords: Deep learning; genomics; epigenomics; multi-omics integration; machine learning; chromatin architecture; precision medicine

1. Introduction

The genomic sequence represents the canonical blueprint of biological systems, a static yet intricate code composed of billions of nucleotide base pairs that encodes the fundamental instructions for life. This linear sequence holds the potential for the vast diversity of cell types and functions observed in complex organisms. However, a critical paradox emerges: every nucleated cell within an organism possesses an essentially identical genome, yet they exhibit extraordinary phenotypic diversity, giving rise to neurons, muscle cells, hepatocytes, and immune cells with distinct morphologies and functions. This profound disparity between genetic uniformity and cellular heterogeneity cannot be resolved by the DNA sequence alone. It is orchestrated by a sophisticated regulatory overlay known as the epigenome—a dynamic and heritable collection of chemical modifications and structural adaptations that modulate genomic function without altering the primary nucleotide sequence [1-3]. 

The epigenome functions as the genome’s operating system, determining which segments of code are executed, silenced, or modulated in response to developmental cues and environmental stimuli. Its principal mechanisms are multifaceted. DNA methylation, the addition of a methyl group to cytosine bases, primarily within CpG dinucleotides, is a classic repressive mark associated with stable gene silencing, genomic imprinting, and X-chromosome inactivation. In contrast, histone modifications—including acetylation, methylation, phosphorylation, and ubiquitination—alter the electrostatic charge and structural configuration of chromatin, the complex of DNA and proteins. For instance, histone acetylation generally relaxes chromatin to promote transcription, while specific methylation patterns (e.g., H3K9me3, H3K27me3) can establish facultative heterochromatin [2]. The cumulative effect of these marks dictates chromatin accessibility, defining regions of open, transcriptionally permissive euchromatin versus compact, silent heterochromatin. Furthermore, the genome is not organized linearly within the nucleus but is folded into a precise three-dimensional (3D) architecture. This spatial organization, governed by loop formations, topologically associating domains (TADs), and compartmentalization, brings distal regulatory elements like enhancers into physical proximity with their target gene promoters, a necessity for precise transcriptional control [3]. Together, these epigenetic layers form an integrated regulatory network that dictates cellular identity, plasticity, and homeostasis. 

The revolution in high-throughput sequencing technologies has enabled the systematic mapping of these genomic and epigenomic features at an unprecedented scale and resolution. Initiatives like the Encyclopedia of DNA Elements (ENCODE) and the International Human Epigenome Consortium (IHEC) have generated terabytes of data, including whole-genome sequences, DNA methylomes, chromatin accessibility maps (ATAC-seq, DNase-seq), histone modification profiles (ChIP-seq), and 3D interaction matrices (Hi-C) [4]. This deluge of multi-modal data presents both an opportunity and a formidable analytical challenge. Traditional bioinformatics approaches and classical machine learning models (e.g., support vector machines, random forests) have provided foundational insights but are often limited. They typically require manual feature engineering, struggle with the high dimensionality and inherent noise of omics data, and are poorly equipped to model the non-linear interactions, long-range genomic dependencies, and hierarchical patterns that characterize biological regulation [5]. 

In this context, deep learning (DL) (Figure 1), a transformative subfield of artificial intelligence, has emerged as a paradigm-shifting analytical framework. Inspired by the structure and function of biological neural networks, DL employs multi-layered (deep) artificial neural networks capable of automatic feature representation learning. Unlike traditional methods, DL models can ingest raw or minimally processed data—such as nucleotide sequences or signal coverage tracks—and autonomously learn hierarchical abstractions, from simple motifs to complex regulatory grammars, directly from the data itself [6]. This capability has driven breakthroughs in fields once thought impervious to automation, such as computer vision and natural language processing (NLP). Notably, the analytical tasks in these fields share conceptual parallels with genomics: interpreting spatial patterns in images is analogous to recognizing sequence motifs and chromatin patterns, while modeling syntax and semantics in language mirrors understanding the regulatory grammar of the genome.  Consequently, the integration of deep learning with multi-omics data is forging a new frontier in computational biology. By simultaneously analyzing heterogeneous datasets—genomic sequence, epigenetic marks, and 3D conformation—DL models offer an unprecedented opportunity to construct unified, predictive models of genome regulation. This integrative approach moves beyond correlation to infer causality, enabling researchers to decode the regulatory logic of cells, predict the functional impact of non-coding genetic variants, unravel novel disease mechanisms rooted in epigenetic dysregulation, and propel systems biology into a new era of mechanistic and predictive power [7]. This review examines the architectures of deep learning models tailored for genomics, their application in integrating and interpreting multi-omics data, the transformative insights they have yielded, and the challenges and future directions of this rapidly evolving synergy.

2. Genomic and epigenomic data types

2.1 Genomic data

Genomic data provide the fundamental alphabetic sequence of an organism’s DNA, serving as the primary template from which all biological instructions are derived. The advent of next-generation sequencing (NGS) technologies has transformed this field, enabling rapid, cost-effective generation of comprehensive datasets that capture both the consensus sequence and its individual variations. Whole-genome sequencing (WGS) delivers a complete, high-fidelity readout of an individual’s entire DNA complement, typically comprising over 3 billion base pairs in humans. The computational analysis of these sequences focuses on identifying variations against a reference genome. These variations include single nucleotide polymorphisms (SNPs), which are single-base changes present in a population; small insertions and deletions (indels); copy number variations (CNVs), which are duplications or deletions of genomic segments ranging from kilobases to megabases; and complex structural variants (SVs) such as inversions, translocations, and complex rearrangements that can dramatically alter genomic architecture [5]. For more targeted and efficient analysis, exome sequencing enriches and sequences only the protein-coding regions (exons), which constitute about 1-2% of the genome but harbor a majority of known pathogenic variants for Mendelian disorders. Collectively, these genomic datasets form the indispensable, albeit static, foundation for human genetics. They enable population-scale studies of genetic predisposition to complex traits and diseases, the identification of rare, high-impact variants in monogenic disorders, and the cataloging of somatic mutations that accumulate in cancer genomes, driving tumorigenesis and progression [6]. While essential, this one-dimensional sequence view is insufficient to explain phenotypic complexity, necessitating the integration of dynamic regulatory information.

2.2 Epigenomic data

Epigenomic data capture the dynamic and cell-type-specific regulatory annotations superimposed on the static DNA sequence, effectively charting the “active” regions of the genome and their spatial relationships. A sophisticated suite of assays, often coupled with NGS, generates high-resolution maps of these features. Chromatin Immunoprecipitation followed by sequencing (ChIP-seq) is a cornerstone technique that uses antibodies to immunoprecipitate DNA bound by specific proteins. It is used to map the genome-wide occupancy of transcription factors, co-activators, or RNA polymerases, and to profile the location of specific histone modifications. For instance, trimethylation of histone H3 at lysine 4 (H3K4me3) marks active promoters, while acetylation of H3 at lysine 27 (H3K27ac) is a hallmark of active enhancers and promoters [7]. To profile the chromatin landscape more broadly, assays like DNase I hypersensitivity sequencing (DNase-seq) and Assay for Transposase-Accessible Chromatin using sequencing (ATAC-seq) identify regions of open, nucleosome-depleted chromatin. These accessible regions are universally indicative of cis-regulatory elements, including promoters, enhancers, and insulators, providing a snapshot of the genome’s regulatory potential in a given cell state [8].

Beyond accessibility and protein binding, the covalent modification of DNA itself is a key epigenetic mark. Bisulfite sequencing treats DNA with sodium bisulfite, which converts unmethylated cytosines to uracil (read as thymine during sequencing) while leaving methylated cytosines unchanged. This allows for base-pair-resolution mapping of DNA methylation, predominantly at CpG dinucleotides. Dense methylation in gene promoters is typically associated with stable transcriptional repression, playing critical roles in X-chromosome inactivation, genomic imprinting, and silencing of transposable elements [9].

Perhaps the most structurally informative data type comes from techniques that probe the three-dimensional (3D) architecture of the genome. Hi-C and its derivatives are high-throughput chromosome conformation capture methods that chemically crosslink spatially proximal DNA segments, then sequence them to generate a genome-wide interaction matrix. This data reveals higher-order organizational features such as chromatin loops that connect enhancers to promoters, topologically associating domains (TADs) which are self-interacting genomic neighborhoods that constrain regulatory interactions, and larger compartmentalization into active (A) and inactive (B) regions [10]. Collectively, these complementary epigenomic datasets—capturing protein binding, chromatin accessibility, DNA modification, and 3D folding—illuminate the complex, multi-layered regulatory landscape that dynamically controls the precise spatiotemporal patterns of gene expression.

Figure 1. Deep learning approaches for integrative genomic and epigenomic data analysis. This schematic diagram illustrates a comprehensive computational framework for integrating genomic and epigenomic datasets using deep learning models. The left panel depicts primary genomic inputs, including sparse DNA sequence data and high-throughput sequencing outputs such as whole-genome sequencing and transcriptomic profiles. The right panel highlights epigenomic datasets, including histone modification profiles, DNA methylation patterns, chromatin accessibility data (e.g., ATAC-seq), chromatin immunoprecipitation sequencing (ChIP-seq), and three-dimensional chromatin interaction data (Hi-C), representing multiple regulatory layers of genome function.  At the center, deep learning architectures—including convolutional neural networks (CNNs), recurrent neural networks (RNNs)/long short-term memory (LSTM) models, transformers, and autoencoders—are shown as the core analytical engines that learn hierarchical and nonlinear representations from multi-omics data. These models integrate heterogeneous inputs to capture complex regulatory patterns, sequence features, and epigenomic signatures.  The bottom panel summarizes key downstream applications of integrative deep learning in genomics. These include prediction of regulatory elements such as enhancers, promoters, silencers, and insulators; chromatin state modeling and epigenetic landscape reconstruction; inference of three-dimensional genome organization and chromatin looping interactions; and translational applications in disease genomics and precision medicine, such as biomarker discovery and therapeutic target identification.  Overall, the figure highlights the workflow from raw genomic and epigenomic data acquisition to deep learning–based integration and biological interpretation, emphasizing the role of artificial intelligence in decoding complex regulatory mechanisms underlying gene expression and disease phenotypes.

3. Deep learning architectures for genomics and epigenomics

3.1 Convolutional Neural Networks (CNNs)

Convolutional Neural Networks (CNNs), originally designed for image recognition, have become a foundational architecture in genomic deep learning due to their exceptional ability to detect local, translation-invariant patterns. Inspired by visual processing, CNNs apply a series of learnable filters (or kernels) that convolve across the input data—whether a one-hot encoded DNA sequence (A, C, G, T represented in four channels) or a continuous signal track from an epigenomic assay (Figure 1). Each filter acts as a motif detector, scanning the sequence to identify specific short, recurring patterns. Through successive convolutional and pooling layers, the network builds a hierarchical representation, learning progressively more complex features from simple *k*-mers to larger combinatorial regulatory grammars [11]. This mirrors how the cell’s machinery recognizes sequence motifs. Pioneering work by DeepBind demonstrated that CNNs could predict the sequence specificity of DNA- and RNA-binding proteins directly from raw sequence, outperforming traditional position weight matrix methods [12]. Subsequent models significantly expanded this capability. DeepSEA used a multi-task CNN to predict the chromatin effects (e.g., transcription factor binding, histone marks) of DNA sequence and assess the impact of non-coding variants. DanQ introduced a hybrid architecture, using a CNN to extract local motifs and a recurrent layer to capture the long-range context of these motifs, achieving state-of-the-art performance in predicting chromatin accessibility and transcription factor binding sites [13]. These successes established CNNs as powerful tools for de novo motif discovery and for modeling the local sequence determinants of regulatory function.

3.2 Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM)

While CNNs excel at capturing local spatial patterns, genomic regulation often involves long-range dependencies where an enhancer located tens of kilobases away influences a promoter’s activity. Recurrent Neural Networks (RNNs) are a class of neural   networks explicitly designed for sequential data. They process inputs (e.g., nucleotides) one step at a time, maintaining a hidden state vector that acts as a memory of all previous elements in the sequence. This allows them, in theory, to model context and dependencies across the entire length of the input. In practice, basic RNNs suffer from the vanishing gradient problem, making it difficult to learn long-term dependencies. The Long Short-Term Memory (LSTM) network, a specialized RNN variant, addresses this with a more complex cell structure containing input, forget, and output gates that regulate the flow of information, enabling it to retain relevant information over much longer sequences [14].

In genomics, RNNs/LSTMs are particularly useful for tasks where the order and context of features are paramount. They have been applied to model DNA as a biological “language,” predicting splice sites, coding potential, and protein-binding residues. A powerful strategy involves hybrid CNN-RNN architectures. In such models, a CNN first processes the raw sequence to extract a rich set of local, position-invariant features (motifs). These feature maps are then fed into an RNN layer (often an LSTM or bidirectional LSTM) that processes them sequentially, learning the spatial relationships and dependencies between the detected motifs. This architecture has proven highly effective for tasks like predicting gene expression levels from extended promoter/enhancer sequences and inferring chromatin states across large genomic windows, as it captures both the “words” (motifs) and the “syntax” (their spatial arrangement) of the genomic code [15].

3.3 Transformers

The Transformer architecture, which revolutionized natural language processing (NLP), has brought a paradigm shift to genomic deep learning. Its core innovation is the self-attention mechanism, which computes a weighted sum of representations for all positions in a sequence when encoding any given position. This allows the model to directly model relationships between any two nucleotides in the input, regardless of distance, overcoming the sequential bottleneck and limited effective range of RNNs [16]. For genomics, this is a game-changer, as it enables efficient modeling of the interactions between distal regulatory elements (e.g., enhancers) and promoters within a single, cohesive model.

Leveraging this power, models like DNABERT adopt a pre-training strategy analogous to BERT in NLP. They are trained on massive corpora of DNA sequences using self-supervised objectives (e.g., masked language modeling), learning rich, context-aware representations of genomic sequences. These pre-trained models can then be efficiently fine-tuned on specific downstream tasks with limited labeled data, such as predicting promoter regions or transcription factor binding sites [17]. The landmark Enformer model exemplifies the transformer’s potential for integrative prediction. Taking a DNA sequence of up to 200 kilobases as input, Enformer uses a transformer-based architecture with attention across the entire span to simultaneously predict thousands of experimental tracks, including gene expression (CAGE) and histone modifications, directly from sequence. It accurately captures the quantitative effects of distal enhancers and can predict the consequences of sequence variants, demonstrating an unprecedented ability to model the cis-regulatory code of entire loci [18].

3.4 Autoencoders and Variational Autoencoders (VAEs)

Autoencoders are neural networks designed for unsupervised learning and dimensionality reduction. They consist of an encoder network that compresses high-dimensional input data (e.g., a vector of gene expression values across 20,000 genes) into a low-dimensional latent representation (or code), and a decoder network that attempts to reconstruct the original input from this code. By training to minimize reconstruction error, the autoencoder is forced to learn the most salient and informative features of the data, effectively discarding noise [19].

Variational Autoencoders (VAEs) introduce a probabilistic framework to this architecture. Instead of learning a single latent code for an input, the encoder learns the parameters (mean and variance) of a probability distribution (typically Gaussian) in the latent space. A sample is then drawn from this distribution and passed to the decoder. This stochastic process ensures the latent space is continuous and structured, allowing for meaningful interpolation and the generation of new, realistic data points by sampling from the prior distribution. In the context of multi-omics, VAEs are powerful tools for integrative analysis. Different omics data types (genomic variants, DNA methylation, gene expression) from the same sample can be encoded into a shared, low-dimensional latent space. This unified representation captures the underlying biological state of the sample, facilitating tasks like identifying coherent patient subtypes that span multiple data layers, imputing missing data modalities, and visualizing complex datasets [20].

3.5 Graph Neural Networks (GNNs)

Biological knowledge and data are inherently relational, forming complex networks—genes regulate each other, proteins interact, and genomic loci contact one another in 3D space. Graph Neural Networks (GNNs) are a family of deep learning models specifically engineered to operate on graph-structured data. A graph is defined by a set of nodes (entities, e.g., genes, genomic bins) and edges (relationships, e.g., regulatory interactions, physical proximity) [21]. The fundamental operation of a GNN is message passing: each node aggregates feature information from its neighboring nodes, updates its own representation based on this aggregated context, and this process is repeated over several layers. This allows nodes to incorporate information from their local network neighborhood.

This architecture is exquisitely suited for analyzing 3D genome organization data. A Hi-C contact map can be treated as a graph where nodes are genomic bins and weighted edges represent interaction frequencies. GNNs can take this graph, along with node features like sequence or epigenetic marks, and perform tasks such as predicting the strength of unobserved chromatin loops, classifying genomic loci as TAD boundaries, or predicting how a structural variant might rewire the 3D interactome [22]. Beyond 3D genomics, GNNs can model gene regulatory networks (with genes as nodes and regulatory edges), protein-protein interaction networks, and single-cell data (where cells are nodes and edges are based on similarity). This enables truly integrative analysis that combines the rich feature information of nodes (from genomics/epigenomics) with the topological structure of their interactions.

4. Integrative multi-omics data analysis

A central challenge and opportunity in modern biology is the integration of multi-omics data—genomics, epigenomics, transcriptomics, proteomics—to construct a unified, systems-level understanding of cellular state and disease mechanisms. Deep learning provides uniquely flexible frameworks for this integration, moving beyond simple concatenation of data vectors.

One powerful approach is the design of multi-modal (or multi-view) neural network architectures. These models are constructed with separate, specialized input branches, or “towers,” for each data modality. For example, one branch might be a CNN for processing DNA sequence, another a CNN for ATAC-seq signal tracks, and a third for RNA-seq expression vectors. Each branch learns features tailored to its data type. Their outputs are then fused in a shared, deeper layer of the network—through concatenation, summation, or more complex attention-based mechanisms—where the model learns the non-linear relationships between modalities to make a joint prediction (e.g., disease subtype or drug response) [23]. This allows the model to leverage complementary information; the sequence might provide causal variants, the chromatin accessibility their regulatory context, and the transcriptome the functional outcome.

A complementary strategy is joint representation learning, often implemented with autoencoders or similar models. Here, the goal is to learn a single, shared latent space where samples (e.g., patient tumors) are positioned based on their integrated biological similarity across all data types. Variational Autoencoders (VAEs) are frequently used for this, as their structured latent space facilitates downstream tasks. Once a unified representation is learned, it can be used for clustering to discover novel molecular subtypes, visualization, or as input to simpler models for prediction[24]. These integrative deep learning models offer significant advantages: they enhance predictive robustness by basing decisions on converging evidence from multiple data layers, improve accuracy for complex phenotypes, and, crucially, can yield novel biological insights by revealing which specific combinations of genomic and epigenomic features are most predictive, thereby highlighting key regulatory circuits and potential therapeutic targets.

5. Applications of deep learning in genomic and epigenomic research

5.1 Regulatory element prediction

A foundational and highly successful application of deep learning in genomics is the de novo identification and functional characterization of cis-regulatory elements (CREs). These non-coding DNA sequences, including promoters, enhancers, silencers, and insulators, control the precise spatial and temporal expression of genes. Traditional methods for CRE discovery relied heavily on evolutionary conservation or the presence of known transcription factor binding motifs, approaches that can miss lineage- or species-specific regulators and lack contextual precision. Deep learning models, particularly Convolutional Neural Networks (CNNs) and more recently Transformers, overcome these limitations. Trained on vast datasets pairing DNA sequence with experimental epigenomic annotations (e.g., ChIP-seq for histone modifications, ATAC-seq for accessibility), these models learn the complex, combinatorial sequence “grammar” that defines different classes of regulatory elements [25].

For instance, a model can be trained not only to identify an enhancer region but to classify its functional state—such as active (marked by H3K27ac), poised (marked by H3K4me1 alone), or repressed (marked by H3K27me3)—directly from sequence or a limited set of input features. This granular prediction dramatically improves the functional annotation of the vast non-coding genome. Furthermore, models like DeepSEA and ExPecto can predict the chromatin effects of sequence variants. By comparing the model’s predictions for reference and alternate alleles at a given genomic position, researchers can score the likely regulatory impact of non-coding single nucleotide polymorphisms (SNPs), providing a powerful tool for prioritizing disease-associated variants from genome-wide association studies (GWAS) that lie in regulatory regions, a task previously fraught with difficulty [26].

5.2 Chromatin state and epigenetic landscape modeling

The genome is packaged into chromatin, which exists in a finite number of recurrent, functionally distinct chromatin states defined by unique combinations of histone modifications, chromatin accessibility, and bound proteins. Manually annotating these states across the genome is complex. Deep learning models excel at this integrative classification task. By training on reference epigenomic maps (e.g., from the Roadmap Epigenomics or ENCODE projects), models can learn to predict a discrete chromatin state (e.g., “active promoter,” “weak enhancer,” “polycomb-repressed region”) for every position in the genome, often using only DNA sequence or a subset of experimental data as input [27].

Beyond static annotation, these models are powerful for dynamic modeling of epigenetic landscapes. By applying models to time-series or condition-specific epigenomic data (e.g., during cellular differentiation, in response to a drug, or in disease progression), researchers can identify key epigenetic transition points and the sequence features that predict them. This allows for the systematic identification of master regulatory regions whose epigenetic state change precedes and potentially drives large-scale transcriptional reprogramming, offering mechanistic insights into developmental processes and disease pathogenesis.

5.3 3D Genome organization

Gene regulation is a three-dimensional process, where distal enhancers must physically loop to contact their target promoters. Deep learning is revolutionizing the analysis of 3D genome architecture data from techniques like Hi-C. One approach uses models to predict chromatin interactions directly from one-dimensional sequence and epigenomic features. Models like DeepTACT and Orca demonstrate that features such as the binding motifs for architectural proteins (e.g., CTCF, cohesin), chromatin accessibility, and histone modifications contain sufficient information to predict the location of chromatin loops and topologically associating domain (TAD) boundaries with high accuracy [28].

A second, powerful approach employs Graph Neural Networks (GNNs) to analyze the Hi-C contact map itself as a graph. Here, nodes represent genomic loci, and edges represent interaction frequencies. GNNs can predict novel loops, classify boundary strength, and, most importantly, predict the structural consequences of genomic variants. For example, a model can be used to simulate how a deletion, duplication, or point mutation at a CTCF site might alter the local 3D interactome, thereby linking non-coding structural variants to dysregulated gene expression through a spatial mechanism—a connection virtually impossible to make with linear analysis alone [29]. This is crucial for understanding congenital disorders and cancer, where structural variants are common.

5.4 Disease genomics and biomarker discovery

Deep learning is accelerating the translation of genomic and epigenomic data into clinical insights. By performing integrative analysis on multi-omics profiles from cohorts of patients and healthy controls, deep learning models can identify subtle, multivariate patterns that elude conventional analysis. These models are adept at discovering novel disease subtypes with distinct molecular etiologies and prognoses, moving beyond histology-based classification. They can also prioritize non-coding driver elements in cancer, identifying enhancers that are somatically mutated, amplified, or epigenetically altered to drive oncogene expression, expanding the search for therapeutic targets beyond protein-coding genes [30].

Furthermore, deep learning enables the derivation of predictive and prognostic biomarkers from complex data. Instead of relying on single genes, models can learn robust multi-omics signatures—combinations of mutations, methylation patterns, chromatin accessibility shifts, and expression changes—that more accurately predict disease risk, progression, or survival outcome. For example, models integrating whole-genome sequencing, DNA methylation, and histopathology images have achieved high accuracy in classifying cancer types and predicting patient survival, demonstrating the superior power of integrated, deep learning-driven biomarkers [31].

5.5 Drug response and precision medicine

The ultimate promise of precision medicine is to match each patient with the therapy most likely to be effective for their specific disease molecular profile. Deep learning is key to realizing this vision. Integrative drug response prediction models take as input a patient’s multi-omics data—germline pharmacogenomic variants, tumor somatic mutation profile, epigenomic state (e.g., methylation silencing of a drug target), and transcriptomic signature—and predict sensitivity or resistance to a panel of drugs [32].

These models operate on two levels. First, they can be used for in-silico drug screening and drug repurposing. By inputting a disease-specific molecular signature (e.g., from a patient’s tumor), a model can rank thousands of compounds based on their predicted ability to reverse that signature towards a healthy state. Second, they facilitate patient stratification for clinical trials and treatment. By identifying the subset of patients whose tumors harbor the molecular features predicted to confer sensitivity to a targeted therapy, trials can be enriched for likely responders, increasing success rates and accelerating drug development. This approach moves beyond single biomarkers (e.g., EGFR mutation) to embrace the complexity of the tumor’s entire molecular circuitry [33].

6. Challenges and limitations

Despite its remarkable potential, the deployment of deep learning in genomics and epigenomics is not without significant obstacles. A primary constraint is the insatiable data hunger of deep models. State-of-the-art architectures, especially transformers, require massive, high-quality, and accurately labeled datasets for training to achieve generalizability. For many rare diseases, specific cellular states, or newly developed epigenomic assays, such datasets are simply unavailable, limiting model applicability [34].

Perhaps the most frequently cited challenge is the “black box” problem. The complex, multi-layered transformations within deep neural networks make it difficult to understand why a model arrived at a specific prediction. In biomedical research, where mechanistic insight is paramount, a prediction without explanation is of limited value. Did the model identify a true biological signal, or is it leveraging a confounding artifact in the data? The lack of interpretability and explainability hinders trust, clinical adoption, and the extraction of novel biological knowledge from successful models [35].

Practical and technical hurdles are also substantial. The computational cost of training large models on genome-scale data is prodigious, requiring access to expensive GPU/TPU clusters and generating a significant carbon footprint. Furthermore, biological data is plagued by technical heterogeneity—batch effects, platform differences, and lab-specific protocols. Deep models are notoriously adept at learning and exploiting these spurious technical signals as shortcuts, leading to impressive performance on held-out data from the same batch but poor generalization to data from new sources or populations [36].

Finally, there is a fundamental risk of models learning dataset-specific artifacts or spurious correlations rather than generalizable biological principles. Without rigorous validation across independent cohorts and experimental perturbation, there is a danger that deep learning may produce sophisticated but biologically meaningless predictions. Addressing these multifaceted challenges requires a concerted, interdisciplinary effort focused on developing Explainable AI (XAI) methods tailored for genomics, creating standardized benchmarks and large, curated public datasets, and advancing techniques for domain adaptation, transfer learning, and federated learning to build robust, privacy-preserving, and widely applicable models [37].

7. Future perspectives

The trajectory of deep learning in genomics points toward increasingly powerful, integrative, and biologically grounded frameworks. A major frontier is the development of multimodal foundation models. Inspired by large language models like GPT, these would be pre-trained on unprecedented scales of diverse data—terabases of genomic sequences, millions of epigenomic tracks, histopathology images, and scientific literature. Such a model would learn a universal representation of biomolecular function and regulation, serving as a versatile starting point that could be efficiently fine-tuned with minimal data for a vast array of downstream tasks, from variant interpretation to drug design [38].

To tackle data privacy and siloing, federated learning will become essential. This paradigm allows models to be trained across decentralized datasets (e.g., at multiple hospitals) without the raw data ever leaving its source, preserving patient confidentiality while leveraging larger, more diverse training populations [39]. The next wave of data integration will involve single-cell and spatial multi-omics. Deep learning models will be crucial to integrate measurements of DNA methylation, chromatin accessibility, and gene expression from the same single cell, and to map these states onto spatial transcriptomics data within a tissue. This will unravel cellular decision-making and tissue organization at an unprecedented resolution, revealing the epigenetic programs of rare cell types and the spatial niches that control them [40].

Finally, generative AI is poised to open new avenues. Generative Adversarial Networks (GANs) and diffusion models could be used to in silico simulate the effects of genetic knockouts, epigenetic editing, or drug treatments, predicting resulting gene expression changes or chromatin remodeling. They could also design synthetic regulatory elements with desired properties or generate realistic, in-silico patient cohorts for powering rare disease research. These capabilities will blur the line between computational prediction and wet-lab experimentation, enabling rapid, low-cost hypothesis generation and validation [41].

8. Conclusion

Deep learning has fundamentally altered the landscape of genomic and epigenomic research, transitioning from a promising novel tool to an indispensable engine of discovery. By providing the capacity to model the non-linear, hierarchical, and integrative nature of biological systems, it has delivered unprecedented accuracy in predicting regulatory elements, inferring 3D genome architecture, and uncovering the molecular basis of disease. The ability of deep learning frameworks to fuse heterogeneous, high-dimensional data types—from raw sequence to spatial interaction maps—has fostered a more comprehensive, systems-level understanding of genome regulation, moving the field beyond reductionist analysis [42].

The impact spans the entire research continuum. In basic science, deep learning models generate novel, testable biological hypotheses about regulatory logic and disease mechanisms. In translational and clinical contexts, they contribute directly to biomarker discovery, patient stratification, and the development of personalized therapeutic strategies, bringing the goals of precision medicine closer to reality [43]. However, this transformative potential is contingent upon overcoming persistent challenges. The critical issues of model interpretability, data quality and scarcity, computational demands, and robust generalization must be addressed through sustained methodological innovation [44-47].

The path forward requires a collaborative ethos. Advances in explainable AI (XAI), the creation of standardized benchmarks and shared resources, and the development of efficient, robust, and privacy-aware learning algorithms are essential. As these technical advances converge with the growing availability of rich, multi-modal biological data, deep learning is poised to solidify its role as the cornerstone of 21st-century computational biology [45]. It will not only accelerate the pace of discovery but also deepen our fundamental understanding of life’s code, ultimately guiding more precise diagnostics and effective therapies, and accelerating the journey toward truly predictive and personalized medicine.

Author Contributions: Conceptualisation, H.R.S.; software, S.F.; investigation, H.R.S.;  writing—original draft preparation, H.R.S.; writing—review and editing, H.R.S.; visualisation, H.R.S.; supervision, H.R.S.; project administration, H.R.S. The author has read and agreed to the published version of the manuscript.

Funding: Not applicable.

Acknowledgments: We are grateful to the Department of Plant Science, McGill University, Raymond Building, 21111, Lakeshore Road, Ste. Anne de Bellevue, Quebec, Canada for providing us all the facilities to carry out the entire work.

Conflicts of Interest: The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.

Data Availability Statement: All the related data are supplied in this work or have been referenced properly.

References

References

  1. Allis CD, Jenuwein T. The Molecular Hallmarks of Epigenetic Control. Nature Reviews Genetics. 2016 Aug;17(8):487–500.
  2. Bannister AJ, Kouzarides T. Regulation of Chromatin by Histone Modifications. Cell Research. 2011 Mar;21(3):381–395.
  3. Lieberman-Aiden E, van Berkum NL, Williams L, Imakaev M, Ragoczy T, Telling A, et al. Comprehensive Mapping of Long-Range Interactions Reveals Folding Principles of the Human Genome. Science. 2009 Oct 9;326(5950):289–293.
  4. ENCODE Project Consortium. An Integrated Encyclopedia of DNA Elements in the Human Genome. Nature. 2012 Sep 6;489(7414):57–74.
  5. Goodwin S, McPherson JD, McCombie WR. Coming of Age: Ten Years of Next-Generation Sequencing Technologies. Nature Reviews Genetics. 2016 Jun;17(6):333–351.
  6. Shendure J, Balasubramanian S, Church GM, Gilbert W, Rogers J, Schloss JA, et al. DNA Sequencing at 40: Past, Present and Future. Nature. 2017 Oct 19;550(7676):345–353.
  7. Park PJ. ChIP-seq: Advantages and Challenges of a Maturing Technology. Nature Reviews Genetics. 2009 Oct;10(10):669–680.
  8. Buenrostro JD, Wu B, Chang HY, Greenleaf WJ. ATAC-seq: A Method for Assaying Chromatin Accessibility Genome-Wide. Current Protocols in Molecular Biology. 2015 Oct 5;109:21.29.1–21.29.9.
  9. Lister R, Pelizzola M, Dowen RH, Hawkins RD, Hon G, Tonti-Filippini J, et al. Human DNA Methylomes at Base Resolution Show Widespread Epigenomic Differences. Nature. 2009 Nov 19;462(7271):315–322.
  10. Lieberman-Aiden E, van Berkum NL, Williams L, Imakaev M, Ragoczy T, Telling A, et al. Comprehensive Mapping of Long-Range Interactions Reveals Folding Principles of the Human Genome. Science. 2009 Oct 9;326(5950):289–293.
  11. Zou J, Huss M, Abid A, Mohammadi P, Torkamani A, Telenti A. A Primer on Deep Learning in Genomics. Nature Genetics. 2019 Jan;51(1):12–18.
  12. Alipanahi B, Delong A, Weirauch MT, Frey BJ. Predicting the Sequence Specificities of DNA- and RNA-Binding Proteins by Deep Learning. Nature Biotechnology. 2015 Aug;33(8):831–838.
  13. Quang D, Xie X. DanQ: A Hybrid Convolutional and Recurrent Deep Neural Network for Quantifying the Function of DNA Sequences. Nucleic Acids Research. 2016 Jun 20;44(11):e107.
  14. Hochreiter S, Schmidhuber J. Long Short-Term Memory. Neural Computation. 1997 Nov 15;9(8):1735–1780.
  15. Singh S, Yang Y, Póczos B, Ma J. Predicting Enhancer-Promoter Interaction from Genomic Sequence with Deep Neural Networks. Quantitative Biology. 2019 Jun;7(2):122–137.
  16. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention Is All You Need. Advances in Neural Information Processing Systems. 2017;30:5998–6008.
  17. Ji Y, Zhou Z, Liu H, Davuluri RV. DNABERT: Pre-trained Bidirectional Encoder Representations from Transformers Model for DNA-Language in Genome. Bioinformatics. 2021 Aug 4;37(15):2112–2120.
  18. Avsec Ž, Agarwal V, Visentin D, Ledsam JR, Grabska-Barwinska A, Taylor KR, et al. Effective Gene Expression Prediction from Sequence by Integrating Long-Range Interactions. Nature Methods. 2021 Oct;18(10):1196–1203.
  19. Hinton GE, Salakhutdinov RR. Reducing the Dimensionality of Data with Neural Networks. Science. 2006 Jul 28;313(5786):504–507.
  20. Kingma DP, Welling M. Auto-Encoding Variational Bayes. Proceedings of the 2nd International Conference on Learning Representations (ICLR). 2014.
  21. Scarselli F, Gori M, Tsoi AC, Hagenbuchner M, Monfardini G. The Graph Neural Network Model. IEEE Transactions on Neural Networks. 2009 Jan;20(1):61–80.
  22. Schwessinger R, Gosden M, Downes D, Brown RC, Oudelaar AM, Telenius J, et al. DeepC: Predicting 3D Genome Folding Using Neural Networks. Nature Methods. 2020 May;17(5):524–532.
  23. Stahlschmidt SR, Ulfenborg B, Synnergren J. Multimodal Deep Learning for Biomedical Data Fusion: A Review. Briefings in Bioinformatics. 2022 Mar;23(2):bbab569.
  24. Chaudhary K, Poirion OB, Lu L, Garmire LX. Deep Learning–Based Multi-Omics Integration Robustly Predicts Survival in Liver Cancer. Clinical Cancer Research. 2018 Mar 15;24(6):1248–1259.
  25. Kelley DR, Snoek J, Rinn JL. Sequential Regulatory Activity Prediction Across Chromosomes with Convolutional Neural Networks. Genome Research. 2018 May;28(5):739–750.
  26. Zhou J, Troyanskaya OG. Predicting Effects of Noncoding Variants with Deep Learning-Based Sequence Model. Nature Methods. 2015 Oct;12(10):931–934.
  27. Ernst J, Kellis M. Chromatin-State Discovery and Genome Annotation with ChromHMM. Nature Protocols. 2017 Dec;12(12):2478–2492.
  28. Fudenberg G, Imakaev M, Lu C, Goloborodko A, Abdennur N, Mirny LA. Emerging Evidence of Chromosome Folding by Loop Extrusion. Cold Spring Harbor Symposia on Quantitative Biology. 2017;82:45–55.
  29. Tang Z, Luo OJ, Li X, Zheng M, Zhu JJ, Szalaj P, et al. CTCF-Mediated Human 3D Genome Architecture Reveals Chromatin Topology for Transcription. Cell. 2015 Dec 17;163(7):1611–1627.
  30. Huang K, Xiao C, Glass LM, Sun J. Deep Learning in Genomics: A Review of Recent Advances and Future Prospects. arXiv preprint arXiv:2110.00927. 2021.
  31. Chaudhary K, Poirion OB, Lu L, Garmire LX. Deep Learning–Based Multi-Omics Integration Robustly Predicts Survival in Liver Cancer. Clinical Cancer Research. 2018 Mar 15;24(6):1248–1259.
  32. Sharifi-Noghabi H, Peng S, Zolotareva O, Collins CC, Ester M. MOLI: Multi-Omics Late Integration with Deep Neural Networks for Drug Response Prediction. Bioinformatics. 2019 Jul 15;35(14):i501–i509.
  33. Ma J, Yu MK, Fong S, Ono K, Sage E, Demchak B, et al. Using Deep Learning to Model the Hierarchical Structure and Function of a Cell. Nature Methods. 2018 Apr;15(4):290–298.
  34. Angermueller C, Pärnamaa T, Parts L, Stegle O. Deep Learning for Computational Biology. Molecular Systems Biology. 2016 Jul 29;12(7):878.
  35. Samek W, Binder A, Montavon G, Lapuschkin S, Müller KR. Evaluating the Visualization of What a Deep Neural Network Has Learned. IEEE Transactions on Neural Networks and Learning Systems. 2017 Nov;28(11):2660–2673.
  36. Hie B, Bryson B, Berger B. Efficient Integration of Heterogeneous Single-Cell Transcriptomes Using Scanorama. Nature Biotechnology. 2019 Jun;37(6):685–691.
  37. Ching T, Himmelstein DS, Beaulieu-Jones BK, Kalinin AA, Do BT, Way GP, et al. Opportunities and Obstacles for Deep Learning in Biology and Medicine. Journal of The Royal Society Interface. 2018 Apr;15(141):20170387.
  38. Rives A, Meier J, Sercu T, Goyal S, Lin Z, Liu J, et al. Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences. Proceedings of the National Academy of Sciences of the United States of America. 2021 Apr 13;118(15):e2016239118.
  39. Li T, Sahu AK, Talwalkar A, Smith V. Federated Learning: Challenges, Methods, and Future Directions. IEEE Signal Processing Magazine. 2020 May;37(3):50–60.
  40. Stuart T, Satija R. Integrative Single-Cell Analysis. Nature Reviews Genetics. 2019 May;20(5):257–272.
  41. Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, et al. Generative Adversarial Networks. Advances in Neural Information Processing Systems. 2014;27:2672–2680.
  42. Camacho DM, Collins KM, Powers RK, Costello JC, Collins JJ. Next-Generation Machine Learning for Biological Networks. Cell. 2018 Jun 14;173(7):1581–1592.
  43. Topol EJ. High-Performance Medicine: The Convergence of Human and Artificial Intelligence. Nature Medicine. 2019 Jan;25(1):44–56.
  44. Doshi-Velez F, Kim B. Towards a Rigorous Science of Interpretable Machine Learning. arXiv preprint arXiv:1702.08608. 2017.
  45. Regev A, Teichmann SA, Lander ES, Amit I, Benoist C, Birney E, et al. The Human Cell Atlas. eLife. 2017 Dec 5;6:e27041.
  46. Qahwaji, R.; Ashankyty, I.; Sannan, N.S.; Hazzazi, M.S.; Basabrain, A.A.; Mobashir, M. Pharmacogenomics: A Genetic Approach to Drug Development and Therapy. Pharmaceuticals 2024, 17, 940.
  47. Mobashir, M.; Turunen, S.P.; Izhari, M.A.; Ashankyty, I.M.; Helleday, T.; Lehti, K. An Approach for Systems-Level Understanding of Prostate Cancer from High-Throughput Data Integration to Pathway Modeling and Simulation. Cells 2022, 11, 4121.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of Global Journal of Basic Science and/or the editor(s). Global Journal of Basic Science and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. 

Copyright: © 2025 by the authors. Submitted for possible open access publication under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Loading

Bibliography