Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2024 Feb 1:23:843-858.
doi: 10.1016/j.csbj.2024.01.014. eCollection 2024 Dec.

Large language models assisted multi-effect variants mining on cerebral cavernous malformation familial whole genome sequencing

Affiliations

Large language models assisted multi-effect variants mining on cerebral cavernous malformation familial whole genome sequencing

Yiqi Wang et al. Comput Struct Biotechnol J. .

Abstract

Cerebral cavernous malformation (CCM) is a polygenic disease with intricate genetic interactions contributing to quantitative pathogenesis across multiple factors. The principal pathogenic genes of CCM, specifically KRIT1, CCM2, and PDCD10, have been reported, accompanied by a growing wealth of genetic data related to mutations. Furthermore, numerous other molecules associated with CCM have been unearthed. However, tackling such massive volumes of unstructured data remains challenging until the advent of advanced large language models. In this study, we developed an automated analytical pipeline specialized in single nucleotide variants (SNVs) related biomedical text analysis called BRLM. To facilitate this, BioBERT was employed to vectorize the rich information of SNVs, while a deep residue network was used to discriminate the classes of the SNVs. BRLM was initially constructed on mutations from 12 different types of TCGA cancers, achieving an accuracy exceeding 99%. It was further examined for CCM mutations in familial sequencing data analysis, highlighting an upstream master regulator gene fibroblast growth factor 1 (FGF1). With multi-omics characterization and validation in biological function, FGF1 demonstrated to play a significant role in the development of CCMs, which proved the effectiveness of our model. The BRLM web server is available at http://1.117.230.196.

Keywords: Cerebral cavernous malformation; Deep learning; Large language model; Natural language processing; Whole genome sequencing.

PubMed Disclaimer

Conflict of interest statement

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Figures

ga1
Graphical abstract
Fig. 1
Fig. 1
BRLM model construction and performance evaluation. (A)BRLM structure, including BioBERT encoded annotation of SNVs into 728-dimensional vectors visualized by UMAP (left), ResNet50 model architecture for SNV classification (middle); and classification results presented by UMAP (right). (B) TCGA pan-cancer classified variants distribution in Nightingale rose diagram. (C) Classification performances among four biomedical encoders in 12 TCGA cancers after 100 epochs. (D) Classification accuracy of BRLM per 10 epochs in 12 TCGA cancers. (E) Classification F1-score of BRLM per 10 epochs in 12 TCGA cancers. (F) Expression comparison between tumor and normal tissues as validation of TCGA variants classification.
Fig. 2
Fig. 2
BRLM mutation classification for a familial CCM WGS. (A) UMAP plot for the BRLM classified CCM SNVs. (B) UMAP of the SNVs with SIFT Scores attached. (C) UMAP for SNVs annotated with Clinvar categories. (D) Functional variants statistics of regulatory and exonic regions for the five classes. (E) Statistics of potential pathogenic variants distribution within functional and exonic regions for Class 1, 2 and 3. (F) Circos plot with low-density functional areas distribution connected by PPI between high CCM risk variants in Class 1, 2. (G) Circos plot with high-density functional areas distribution connected by PPI between uncertain CCM risk variants in Class 3.
Fig. 3
Fig. 3
Enrichment results for mutated genes in Class 1, 2 and 3. (A) Similarity clustering heatmap for enriched pathways with term frequencies exhibited by font size. (B) K-means clustering for the top 50 pathways with the most significant p-values. (C) The top 10 enriched pathways enumeration in terms of p-values.
Fig. 4
Fig. 4
Pathways perturbation simulation derived from mutated genes in Class 1, 2 and 3. (A) Tree plot for the top 10 pathways with the highest PMAP score, where the pie chart shows the proportion of involved genes from Class 1, 2 and Class 3. It is evident that fewer mutated genes in Class 1, 2 play more important perturbation roles than those in Class 3. (B) Containment relationship for top 10 perturbated pathways and functional domain mutated genes. (C) Sankey plot for risk CCM-related elements in three levels for mutations, genes and pathways.
Fig. 5
Fig. 5
The FGF1 acts as the master regulator gene upstream of perturbed pathways from multi-omics results integration. (A) FGF1 specific expressed in Astrocytes Cluster from scRNA-seq. The markers for “Astrocytes” cluster are colored in blue, while the marker for “Capillaries” is in red. (B) FGF1 mutation sites distribution in two mutant transcripts from WGS. (C) Differential expression profiles for multi-effect genes from RNA-seq. (D) Main effect gene FGF1 with multiple functional variants located upstream of peaked genes from perturbated and enriched pathways, reacted with mutated genes in Class 1, 2 and 3 including KRIT1, one of the three known CCM pathogenic genes. Multi-connection mutated genes in the same pathway are outlined with dashed lines.
Fig. 6
Fig. 6
BRLM workflow for variant annotations classifying. Starting with annotated data wrangling, embedded vectors are constructed by BioBERT, which are classified by ResNet50 for distinct datasets.
Fig. 7
Fig. 7
Particular structure of ResNet compared with classical convolutional neural networks and ResNet50 architecture diagram in BRLM. (A) ResNet residual network with skip connection can solve gradient vanishing problem. (B) ResNet-50 architecture constructed in BRLM for SNVs classification.

Similar articles

Cited by

References

    1. Adler D.A., Ben-Zeev D., Tseng V.W., Kane J.M., Brian R., Campbell A.T., et al. Predicting early warning signs of psychotic relapse from passive sensing data: an approach using encoder-decoder neural networks. JMIR mHealth uHealth. 2020;8 - PMC - PubMed
    1. Afzal S., Asim M., Javed A.R., Beg M.O., Baker T. Urldeepdetect: a deep learning approach for detecting malicious urls using semantic vector models. J Netw Syst Manag. 2021;29:1–27.
    1. Ahir B.K., Engelhard H.H., Lakka S.S. Tumor development and angiogenesis in adult brain tumor: glioblastoma. Mol Neurobiol. 2020;57:2461–2478. - PMC - PubMed
    1. Atkinson E., Dickman R. Growth factors and their peptide mimetics for treatment of traumatic brain injury. Bioorg Med Chem. 2023 - PubMed
    1. Bathke J., Lühken G. Ovarflow: a resource optimized gatk 4 based open source variant calling workflow. BMC Bioinforma. 2021;22(1):18. - PMC - PubMed

LinkOut - more resources

  NODES
twitter 2