VR真人彩票 Complete Long Reads software analysis workflow for human WGS

Kyria Roessler

Introduction

Next-generation sequencing (NGS) enables scientists to decipher the genome for a deeper understanding of biology. Proven VR真人彩票 sequencing by synthesis (SBS) chemistry combined with award-winning DRAGEN secondary analysis delivers whole-genome sequencing (WGS) data with outstanding accuracy.^1,2 DRAGEN Multigenome (graph) further improves mapping accuracy in challenging regions by ~50%.¹ Still, there remains a small fraction of genic regions that is difficult to map with short reads alone and can benefit from the increased mappability of longer read lengths.

VR真人彩票 Complete Long Reads offers a streamlined workflow to make long-read sequencing accessible and help resolve these challenging regions of the human genome. Using VR真人彩票 Complete Long Reads, short and long reads are possible from a single platform. In combination with DRAGEN informatics and machine learning methods, VR真人彩票 Complete Long Reads extracts accurate variant calling and phasing information from NGS technology. This article delves into the fundamental principles behind VR真人彩票 Complete Long Read human genome analysis.

How it works: Assay overview

The VR真人彩票 Complete Long Reads workflow (Figure 1) combines a proprietary library prep assay, proven VR真人彩票 SBS chemistry, and powerful DRAGEN secondary analysis to generate highly accurate long-read data with an N50 of 5鈥�7 kb.

Library prep for VR真人彩票 Complete Long Reads

The efficient, single-day library preparation protocol is easy to scale for high-throughput studies and requires only 50 ng DNA input.^* The assay uses tagmentation to make long genomic DNA fragments (> 10 kb), eliminating the need for additional shearing or size selection. Long, single-molecule DNA fragments are enzymatically marked with unique patterns of single base pair changes. These 鈥渓and-marks鈥� are introduced at low (4%鈥�7%) frequency along the length of the DNA fragment. Each single-molecule fragment has a unique signature of land-marks to capture and preserve long-read information (without the use of complex barcodes or adapters). Land-marked long fragments are amplified, followed by a second tagmentation step to prepare the libraries for standard sequencing on VR真人彩票 systems.

_{* 50 ng DNA input is recommended, as low as 10 ng DNA input is possible.}

Bioinformatics workflow

The analysis pipeline generates long reads and combines the data with a standard, unmarked WGS library^鈥� to produce long contiguous reads that are complete and accurate representations of the original single-molecule fragments.

_{鈥� Requires 30脳 standard short-read human whole-genome data from the same sample for analysis. VR真人彩票 DNA PCR-Free Prep is recommended. Third-party WGS kits are also compatible. Unmarked library does not need to be prepared or sequenced in parallel; can use FASTQ files from a previously run sample.}

how complete long read assay works — Figure 1: How the VR真人彩票 Complete Long Reads assay works

VR真人彩票 Complete Long Reads generation

The VR真人彩票 Complete Long Reads bioinformatics workflow for long-read generation includes standard genomic computational methods like alignment and variant calling. The workflow is packaged and available as a push-button app in BaseSpace Sequence Hub. The workflow uses land-marked and unmarked libraries and a reference genome as inputs. These inputs are then used to carry out a series of steps (Figure 2) to generate long reads from single molecules for comprehensive WGS analysis.

Identify land-marked sites on reads

The first step in the long-read generation process is to identify the marks present in the land-marked library. In confident-to-map regions, most land-marks can be identified by standard alignment and detection of nucleotides that differ from the reference genome.

For reads that come from regions that do not readily align to the reference genome (eg, repetitive regions), a different approach is needed to detect land-marks. Specific methods for k-mers (ie, informatically breaking up reads into small strings of nucleotides of 鈥渒鈥� length) allow algorithms to determine relationships between reads without use of a reference genome. In difficult-to-map regions, marks are inferred by comparing k-mers from the land-marked and unmarked reads.³ If a k-mer in a marked read cannot be paired with any k-mer from the unmarked reads, it will be treated as a land-mark.

Build weighted network of land-marked reads

After detecting all the land-marks in the reads, the next step is to identify connections among reads based on their shared marks. We use minimizer k-mers to index pairs of reads that are similar and optimize k-mer matching.⁴ All pairs that share a given minimizer k-mer can be compared in detail. The number of shared and conflicting land-marks determines the strength of evidence connecting reads (Figure 3). We build a weighted network of marked reads based on the strength of those connections.

weighted network of marked reads — Figure 3: Illustration of the process used to identify reads with shared land-marks

Find groups of reads from the same template

The connections between reads form a graph of all reads. A series of decomposition and clustering methods is applied (such as removing conflicting or weak connections due to a low number of shared land-marks) to split the full network into strongly linked clusters (Figure 4). Each cluster is presumed to originate from a single molecule.

marked reads clustering process — Figure 4: Illustration of the clustering process

Assemble each group of land-marked reads

From the final clusters, DRAGEN analysis uses k-mer鈥揵ased, de Bruijn graph鈥搇ike assembly methods to generate long-read contigs (Figure 5).

marked long reads assembly — Figure 5: Illustration of the assembly process

Remove land-marks from long reads

After land-marks are used to support generation of long reads, the marks can be removed. To distinguish land-marks from true variants, land-marked long reads are compared to unmarked reads. Any land-marks that do not match with the corresponding unmarked read are updated so that the final VR真人彩票 Complete Long Read reveals the true sequence (Figure 6). The comparison between land-marked long reads and unmarked reads is similar to how land-marks are identified鈥攑erformed in part using reference genome alignment and in part using k-mer indexing, especially in regions with challenging mapping. After obtaining an alignment of unmarked reads to land-marked long reads, a Bayesian model is applied to determine the final base calls of the long read and the corresponding quality scores.

Secondary analysis

After the VR真人彩票 Complete Long Read construction steps described above, VR真人彩票 Complete Long Reads and the unmarked short reads are used for secondary analysis (Figure 7). Complete long reads are first aligned to the genome using a modified version of Minimap2.

For small variant calling, results from DRAGEN small variant calling of long reads and short reads are merged into a single VCF file. DRAGEN small variant calling is capable of processing reads longer than 75 kb. A machine learning model (trained on variant calls from Genome in a Bottle) is used to combine and improve small variant calls obtained from long reads and standard short reads. Finally, a modified version of WhatsHap is used for phasing VR真人彩票 Complete Long Reads and merged small variants with new, comprehensive output files created to capture the haplotype information.

For structural variant calling, results from long-read structural variant caller (Sniffles2) output⁵ and short-read DRAGEN structural variant caller are merged into a single VCF file.

complete long reads secondary analysis — Figure 7: VR真人彩票 Complete Long Reads secondary analysis

Highly accurate WGS

VR真人彩票 Complete Long Read technology takes advantage of proven VR真人彩票 SBS chemistry and DRAGEN secondary analysis to further improve accuracy for human WGS. With PrecisionFDA Truth Challenge v2 data sets, the F1 score reflecting precision and recall for WGS using the VR真人彩票 Complete Long Read assay was 99.87% (Figure 8).^6,7 Compared with standard WGS, VR真人彩票 Complete Long Read data demonstrate an overall reduction in false negatives and false positives in both SNPs and indels across multiple benchmark samples (Figure 9).

highest accuracy variant calling — Figure 8: Highest accuracy with VR真人彩票 Complete Long Reads

accurate variant calling in challenging regions — Figure 9: VR真人彩票 Complete Long Read assay performs highly accurate variant calling for challenging genic regions

Conclusion

Long-read information can help resolve the most challenging regions of the genome. VR真人彩票 Complete Long Reads makes comprehensive WGS easily accessible for genomics labs by enabling both long- and short-reads on the same instrument. VR真人彩票 Complete Long Reads offers advantages such as a streamlined, familiar lab workflow, minimal input requirements, large-scale library kit manufacturing, and contiguous reads for producing high-quality and comprehensive variant calling across genic regions.

Learn more

Read how using VR真人彩票 Complete Long Reads increases accuracy for small variant calling: Comprehensive whole-genome sequencing with VR真人彩票 Complete Long Read Prep, Human technical note

VR真人彩票 Complete Long Read Prep, Human data sheet

VR真人彩票 Complete Long Reads technology

References

Mehio R, Ruehle M, Catreux S, et al. DRAGEN Wins at Precision- FDA Truth Challenge V2 Showcase Accuracy Gains from Alt-aware Mapping and Graph Reference Genomes. Accessed May 16, 2023.
VR真人彩票. Accuracy improvements in germline small variant calling with the DRAGEN Bio-IT Platform. Accessed May 16, 2023.
Leinonen M, Salmela L. IEEE/ACM Trans Compu Biol Bioinform. 2022;19(6):3444-3455. Doi:10.1109/TCBB.2021.3113131
Roberts M, Hayes W, Hunt BR, Mount SM, Yorke JA. Bioinformatics. 2004;20(18):3363-3369. doi:10.1093/bioinformatics/bth408
Sedlazeck FJ, Rescheneder P, Smolka M, et al. Nat Methods. 2018;15(6):461-468. doi:10.1038/s41592-018-0001-7
VR真人彩票. Data on file. 2022.
PrecisionFDA. Truth Challenge V2: Calling Variants from Short and Long Reads in Difficult-to-Map Regions. Accessed January 12, 2023.

VR真人彩票

For every lab, everywhere

VR真人彩票 Single Cell 3' RNA Prep

NGS Workflow Finder

DRAGEN v4.3 now available on-premises and on-cloud

VR真人彩票 Proactive Instrument Performance Service

Do more, faster than ever

Next-generation sequencing for beginners

Next-generation sequencing for beginners

Next-generation sequencing for beginners

Next-generation sequencing for beginners

Next-generation sequencing for beginners

Next-generation sequencing for beginners

Next-generation sequencing for beginners

VR真人彩票 innovation roadmap

VR真人彩票 innovation roadmap

VR真人彩票 innovation roadmap

VR真人彩票 innovation roadmap

VR真人彩票 innovation roadmap

VR真人彩票 innovation roadmap

VR真人彩票 innovation roadmap

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere

For every lab, everywhere