Predicting clinical impact of human mutation with deep neural networks

Predicting the clinical impact of human mutation with deep neural networks

Hong Gao and Kyle Farh; published April 21, 2021

Introduction

Millions of human genomes and exomes have been sequenced, but their clinical applications remain limited due to the difficulty of distinguishing disease-causing mutations from benign genetic variation^1,2. Because of their deleterious effects on fitness, clinically significant genetic variants tend to be extremely rare in the population³. Therefore, the observation of a variant at high frequencies in the population is strong evidence in favor of benign consequence^2,4, enabling pathogenic mutations to be systematically identified by process of elimination. Assaying common variation across diverse human populations is an effective strategy for cataloguing benign variants⁵, but the total amount of common variation in present day humans is limited. Out of more than 70 million potential missense variants in the reference genome, only roughly 1 in 1000 are present at greater than 0.1% overall population allele frequency^5,6.

Outside of modern human populations, chimpanzees comprise the next closest extant species, and share 99.4% amino acid sequence identity⁷. The near-identity of protein-coding sequence in humans and chimpanzees suggests that natural selection operating on chimpanzee protein-coding variants might also model the consequences on fitness of human identical mutations. If polymorphisms that are identical-by-state similarly affect fitness in the two species, the presence of a variant at high allele frequencies in chimpanzee populations should indicate benign consequence in human, expanding the catalog of known benign variants substantially. This formulates the hypothesis that needs to be verified with chimpanzee variants.

锘縒e demonstrated that common primate variants tend to be benign in human population. Using hundreds of thousands of common variants from population sequencing of six non-human primate species as training data, we developed PrimateAI, a deep neural network that predicts pathogenic mutations with high accuracy.

Common variants in other primates are largely benign in human

The recent availability of aggregated exome data, comprising 123,136 humans collected in the Exome Aggregation Consortium (ExAC) and Genome Aggregation Database (gnomAD), allows us to measure the impact of natural selection on missense and synonymous mutations across the allele frequency spectrum⁵. Singleton variants (observed only once in the cohort) closely match the expected 2.2:1 missense:synonymous ratio predicted by de novo mutation after adjusting for confounding factors (Fig. 1a)⁸, but at higher allele frequencies the number of observed missense variants decreases due to the purging of deleterious mutations by natural selection.

Figure 1 Missense: synonymous ratios across the human allele frequency spectrum.

Primate variants were obtained from 锘縯he great ape genome sequencing project and 锘縟bSNP^9,10. We first examined common chimpanzee variants that are identical-by-state with human variants (Fig. 1b), and discovered the missense:synonymous ratio is largely constant across the human allele frequency spectrum, which is consistent with absence of negative selection against common chimpanzee variants in the human population. The low missense:synonymous ratio observed in human variants that are identical-by-state with common chimpanzee variants is consistent with the larger effective population size in chimpanzee, which enables more efficient filtering of mildly deleterious variation^11,12.

We next identified human variants that are identical-by-state with variation observed in at least one of six non-human primate species. Variation in each of the six species largely represent common variants based on the limited number of individuals sequenced and the low missense:synonymous ratios observed for each species. Similar to chimpanzee, we find that the missense:synonymous ratios for variants from the six non-human primate species are roughly equal across the human allele frequency spectrum, other than a mild depletion of missense variation at common allele frequencies (Fig. 2), which is expected due to the inclusion of a minority of rare variants.

We find that human missense variants that are identical-by-state with observed primate variants are strongly enriched for benign consequence in the ClinVar database¹³. After excluding variants of uncertain significance and those with conflicting annotations, ClinVar variants that are present in at least one non-human primate species are annotated as Benign or Likely Benign on average 90% of the time, compared to 35% for ClinVar missense variants in general (Fig. 3). The pathogenicity of ClinVar annotations for primate variants is slightly greater than that observed from sampling a similarly sized cohort of healthy humans (~95% Benign or Likely Benign consequence).

A deep learning network for variant pathogenicity classification

The importance of variant classification for clinical applications has inspired numerous attempts to use supervised machine learning to address the problem, but these efforts have been hindered by the lack of an adequately-sized truth dataset containing confidently labeled benign and pathogenic variants for training^14-24. Existing databases of human expert curated variants cover a small fraction of the genome, with ~50% of the variants in the ClinVar database coming from only 200 genes (~1% of human protein-coding genes). Moreover, systematic studies reveal that many human expert annotations have questionable supporting evidence^5,25, underscoring the difficulty of interpreting rare variants that may be observed in only a single patient. To reduce human interpretation biases, recent classifiers have been trained on common human polymorphisms or fixed human-chimpanzee substitutions^26-29, but these classifiers also use as their input the prediction scores of earlier classifiers that were trained on human curated databases. Objective benchmarking of the performance of these various methods has been elusive in the absence of an independent, bias-free truth dataset³⁰.

Variation from the six non-human primates (chimpanzee, bonobo, gorilla, orangutan, rhesus, and marmoset) contributes over 300,000 unique missense variants that are non-overlapping with common human variation, and largely represent common variants of benign consequence that have been through the sieve of purifying selection, greatly enlarging the training dataset available for machine learning approaches. On average, each primate species contributes the equivalent of 50K variants, more variants than the current total in the whole of the ClinVar database. Additionally, this content is free from biases in human interpretation.

Using a dataset consisting of common human variants and primate variation, we trained a novel deep residual network, PrimateAI (https://github.com/VR真人彩票/PrimateAI), which takes as input the amino acid sequence flanking the variant of interest and the orthologous sequence alignments in other species (Fig. 4a)³¹. Unlike existing classifiers which employ human-engineered features, our deep learning network learns to extract features directly from primary sequence. To incorporate information about protein structure, we trained separate networks to predict secondary structure and solvent accessibility from sequence alone^32,33, and then included these as sub-networks in the full model (Fig. 4b). Given the small number of human proteins that have been successfully crystallized, inferring structure from primary sequence has the advantage of avoiding biases due to incomplete protein structure and functional domain annotation. The total depth of the network, with protein structure included, was 36 layers of convolutions, consisting of roughly 400,000 trainable parameters.

To train a classifier using only variants with benign labels, we framed the prediction problem as whether a given mutation is likely to be observed as a common variant in the population. Several factors influence the probability of observing a variant at high allele frequencies, of which we are interested only in deleteriousness. We matched each variant in the benign training set with a unlabeled missense mutation, controlling for the confounding factors, and trained the deep learning network to distinguish between benign variants and matched controls⁸. As the number of unlabeled variants greatly exceeds the size of the labeled benign training dataset, we trained eight networks in parallel, each using a different set of unlabeled variants matched to the benign training dataset, to obtain a consensus prediction.

Example of Pathogenicity Prediction

Using only primary amino acid sequence as its input, the deep learning network accurately assigns high pathogenicity scores to residues at critical protein functional domains, as shown for the voltage-gated sodium channel SCN2A (Fig. 5), a major disease gene in epilepsy, autism, and intellectual disability. The structure of the SCN2A consists of four homologous repeats, each containing six transmembrane helixes (S1-S6)^34,35. Upon membrane depolarization, the positively-charged S4 transmembrane helix moves towards the extracellular side of the membrane, causing the S5/S6 pore-forming domains to open via the S4-S5 linker. Mutations in the S4, S4-S5 linker, and S5 domains, which are clinically associated with early onset epileptic encephalopathy³⁶, are predicted by the network to have the highest pathogenicity scores in the gene, and are depleted for variants in the healthy population.

We compared the performance of our network with existing classification algorithms, using 10,000 common primate variants that were withheld from training. Because ~50% of all newly arising human missense variants are filtered by purifying selection at common allele frequencies (Fig. 1a), we determined the 50th-percentile score for each classifier using randomly selected variants that were matched to the 10,000 common primate variants by mutational rate and sequencing coverage, and evaluated the accuracy of each classifier at that threshold (Fig. 6). Our deep learning network (91% accuracy) surpassed the performance of other classifiers (80% accuracy for the next best model) at assigning benign consequence to the 10,000 withheld common primate variants. Roughly half the improvement over existing methods comes from using the deep learning network, and half comes from augmenting the training dataset with primate variation, as compared to the accuracy of the network trained with human variation data only (Fig. 6).

To test classification of variants of uncertain significance in a clinical scenario, we evaluated the ability of the deep learning network to distinguish between de novo mutations occurring in patients with neurodevelopmental disorders versus healthy controls. By prevalence, neurodevelopmental disorders constitute one of the largest categories of rare genetic diseases³⁷, and recent trio sequencing studies have implicated the central role of de novo missense and protein truncating mutations^38-41. We classified each confidently called de novo missense variant in 4,293 affected individuals from the Deciphering Developmental Disorders cohort (DDD)⁴², versus de novo missense variants from 2,517 unaffected siblings in the Simon鈥檚 Simplex Collection cohort (SSC)⁴³, and assessed the difference in prediction scores between the two distributions with the Wilcoxon rank-sum test (Fig. 7a). The deep learning network clearly outperforms other classifiers on this task (Fig. 7b).

We next sought to estimate the accuracy of the deep learning network at classifying benign versus pathogenic mutations within the same gene. Given that the DDD population largely consists of index cases of affected children without affected first degree relatives, it is essential to show that the classifier has not inflated its accuracy by favoring pathogenicity in genes with de novo dominant modes of inheritance. We restricted the analysis to 605 genes that were nominally significant for disease association in the DDD study, calculated from protein-truncating variation only⁴². Within these genes, de novo missense mutations are enriched 3:1 compared to expectation (Fig. 8a), indicating that ~67% are pathogenic. The deep learning network was able to discriminate pathogenic and benign de novo variants within the same set of genes (Fig. 8b), outperforming other methods by a large margin (Fig. 8c).

At a binary cutoff of 鈮� 0.803 (Fig. 9a), 65% of de novo missense mutations in cases are classified by the deep learning network as pathogenic, compared to 14% of de novo missense mutations in controls, corresponding to a classification accuracy of 88% (Fig. 9b). Given frequent incomplete penetrance and variable expressivity in neurodevelopmental disorders⁴⁴, this figure likely underestimates the accuracy of our classifier due to the inclusion of partially penetrant pathogenic variants in controls.

Our results suggest that systematic primate population sequencing is an effective strategy to classify the millions of human variants of uncertain significance that currently limit clinical genome interpretation. The accuracy of our deep learning network on both withheld common primate variants and clinical variants increases with the number of benign variants used to train the network. Cataloging common variation from additional primate species would improve interpretation for millions of variants of uncertain significance, further advancing the clinical utility of human genome sequencing.

Acknowledgements

We would like to thank J. K. Pritchard, M. E. Hurles, J. W. Belmont, and R. E. Green for insightful discussions. We would like to thank the Genome Aggregation Database (gnomAD) and the groups that provided exome and genome variant data to this resource. Yanjun Li and Xiaolin Li were partially supported by R01GM110240 from the National Institute of General Medical Sciences and National Science Foundation (grants CNS- 1747783, CNS- 1624782, and OAC-1229576). We would like to acknowledge the authors in the original paper, including Laksshman Sundaram, Samskruthi Reddy Padigepati, Jeremy F. McRae, Yanjun Li, Jack A. Kosmicki, Nondas Fritzilas, Jorg Hakenberg, Anindita Dutta, John Shon, Jinbo Xu, Serafim Batzloglou, and Xiaolin Li.

External links

Publication:

Software:

Primate polymorphisms from the great ape genome project:

And from dbSNP database:

PrimateAI scores of 70 million variants:

References

MacArthur, D. G. et al. Nature 508, 469-476, doi:10.1038/nature13127 (2014).
Rehm, H. L., J. S. Berg, L. D. Brooks, C. D. Bustamante, J. P. Evans, M. J. Landrum, D. H. Ledbetter, D. R. Maglott, C. L. Martin, R. L. Nussbaum, S. E. Plon, E. M. Ramos, S. T. Sherry, M. S. Watson. N. Engl. J. Med. 372, 2235-2242 (2015).
Bamshad, M. J., S. B. Ng, A. W. Bigham, H. K. Tabor, M. J. Emond, D. A. Nickerson, J. Shendure. Nat. Rev. Genet. 12, 745鈥�755 (2011).
Richards, S. et al. Genet Med 17, 405-424, doi:10.1038/gim.2015.30 (2015).
Lek, M. et al. Nature 536, 285-291, doi:10.1038/nature19057 (2016).
Liu, X., X. Jian, E. Boerwinkle. . Human Mutation 32, 894鈥�899 (2011).
Chimpanzee Sequencing Analysis Consortium. Nature 437, 69-87, doi:10.1038/nature04072 (2005).
Samocha, K. E. et al. Nat Genet 46, 944-950, doi:10.1038/ng.3050 (2014).
Sherry, S. T. et al. Nucleic Acids Res 29, 308-311, doi:10.1093/nar/29.1.308 (2001).
Prado-Martinez, J. et al. Nature 499, 471-475 (2013).
Kimura, M. Cambridge University Press, 1983
de Manuel, M. et al. Science 354, 477-481, doi:10.1126/science.aag2602 (2016).
Landrum, M. J. et al. Nucleic Acids Res 44, D862-868, doi:10.1093/nar/gkv1222 (2016).
Ng, P. C. & Henikoff, S. Genome Res 11, 863-874, doi:10.1101/gr.176601 (2001).
Adzhubei, I. A. et al. Nat Methods 7, 248-249, doi:10.1038/nmeth0410-248 (2010).
Chun, S., J. C. Fay. Genome Research 19, 1553-1561 (2009).
Schwarz, J. M., C. R枚delsperger, M. Schuelke, D. Seelow. Nat. Methods 7, 575鈥�576 (2010).
Reva, B., Antipin, Y. & Sander, C. Nucleic Acids Res 39, e118, doi:10.1093/nar/gkr407 (2011).
Dong, C. et al. Hum Mol Genet 24, 2125-2137, doi:10.1093/hmg/ddu733 (2015).
Carter, H., Douville, C., Stenson, P. D., Cooper, D. N. & Karchin, R. BMC Genomics 14 Suppl 3, S3, doi:10.1186/1471-2164-14-S3-S3 (2013).
Choi, Y., Sims, G. E., Murphy, S., Miller, J. R. & Chan, A. P. PLoS One 7, e46688, doi:10.1371/journal.pone.0046688 (2012).
Gulko, B., Hubisz, M. J., Gronau, I. & Siepel, A. Nat Genet 47, 276-283, doi:10.1038/ng.3196 (2015).
Shihab, H. A. et al. Bioinformatics 31, 1536-1543, doi:10.1093/bioinformatics/btv009 (2015).
Quang, D., Chen, Y. & Xie, X. Bioinformatics 31, 761-763, doi:10.1093/bioinformatics/btu703 (2015).
Bell, C. J., D. L. Dinwiddie, N. A. Miller, S. L. Hateley, E. E. Ganusova, J. Midge, R. J. Langley, L. Zhang, C. L. Lee, R. D. Schilkey, J. E. Woodward, H. E. Peckham, G. P. Schroth, R. W. Kim, S. F. Kingsmore. Sci. Transl. Med. 3, 65ra64 (2011).
Kircher, M., D. M. Witten, P. Jain, B. J. O鈥橰oak, G. M. Cooper, J. Shendure. Nat. Genet. 46, 310-315 (2014).
Smedley, D. et al. Am J Hum Genet 99, 595-606, doi:10.1016/j.ajhg.2016.07.005 (2016).
Ioannidis, N. M. et al. Am J Hum Genet 99, 877-885, doi:10.1016/j.ajhg.2016.08.016 (2016).
Jagadeesh, K. A., A. M. Wenger, M. J. Berger, H. Guturu, P. D. Stenson, D. N. Cooper, J. A. Bernstein, G. Bejerano. Nature Genetics 48, 1581-1586 (2016).
Grimm, D. G. Human Mutation 36, 513-523 (2015).
He, K., X. Zhang, S. Ren, J. Sun. IEEE 770-778.
Heffernan, R. et al. Sci Rep 5, 11476, doi:10.1038/srep11476 (2015).
Wang, S., J. Peng, J. Ma, J. Xu. Scientific Reports 6, 18962-18962 (2016).
Payandeh, J., Scheuer, T., Zheng, N. & Catterall, W. A. The crystal structure of a voltage-gated sodium channel. https://www.nature.com/articles/nature10238
Shen, H. et al. Structure of a eukaryotic voltage-gated sodium channel at near-atomic resolution. https://science.sciencemag.org/content/355/6328/eaal4326
Nakamura, K. et al. Neurology 81, 992-998, doi:10.1212/WNL.0b013e3182a43e57 (2013).
Vissers, L. E., Gilissen, C. & Veltman, J. A. Nat Rev Genet 17, 9-18, doi:10.1038/nrg3999 (2016).
Neale, B. M. et al. Nature 485, 242-245, doi:10.1038/nature11011 (2012).
Sanders, S. J. et al. Nature 485, 237-241, doi:10.1038/nature10945 (2012).
De Rubeis, S. et al. Nature 515, 209-215, doi:10.1038/nature13772 (2014).
Deciphering Developmental Disorders Study. Nature 519, 223-228, doi:10.1038/nature14135 (2015).
Deciphering Developmental Disorders Study. Nature 542, 433-438, doi:10.1038/nature21062 (2017).
Iossifov, I. et al. Nature 515, 216-221, doi:10.1038/nature13908 (2014).
Zhu, X., Need, A. C., Petrovski, S. & Goldstein, D. B. Nat Neurosci 17, 773-781, doi:10.1038/nn.3713 (2014).

VR真人彩票

For every lab, everywhere

VR真人彩票 Single Cell 3' RNA Prep

NGS Workflow Finder

DRAGEN v4.3 now available on-premises and on-cloud

VR真人彩票 Proactive Instrument Performance Service

Do more, faster than ever

Next-generation sequencing for beginners

Next-generation sequencing for beginners

Next-generation sequencing for beginners

Next-generation sequencing for beginners

Next-generation sequencing for beginners

Next-generation sequencing for beginners

Next-generation sequencing for beginners

VR真人彩票 innovation roadmap

VR真人彩票 innovation roadmap

VR真人彩票 innovation roadmap

VR真人彩票 innovation roadmap

VR真人彩票 innovation roadmap

VR真人彩票 innovation roadmap

VR真人彩票 innovation roadmap

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

Predicting the clinical impact of human mutation with deep neural networks

Introduction

Common variants in other primates are largely benign in human

A deep learning network for variant pathogenicity classification

Example of Pathogenicity Prediction

Acknowledgements

External links

References