How Dash transforms Next Generation Sequencing Quality Control

Plotly

Published in

Plotly

6 min readOct 27, 2021

📌 Learn more about Dash, Dash Bio, and Alignment Chart ➡️ watch the recorded webinar!

Next-Generation Sequencing

Every cell in your body contains DNA that carries genetic information. It is the blueprint for how organisms grow, reproduce, and function. Next-Generation Sequencing is used to sequence the nucleotides to identify base sequences. By identifying genes and comparing genetic information from other individuals or species, you can:

Research personal genomics and health
Identify novel pathogens
Diagnose diseases
Do Phylogenetic Analysis
Do Pharmacogenetics

Next-Generation Sequencing (NGS) systems have been introduced in the past decade and have allowed for parallel sequencing. They can analyze millions of sequencing reactions at the same time. Although there are many different machines on the market each with different types of sequencing methods, they all use the same 3 basic steps:

Library prep
Sequencing
Data Analysis

Library Prep: A library is a collection of similarly sized DNA fragments with custom adapter sequences. An adapter sequence is a short piece of DNA that allows the fragments of DNA to be identified following sequencing. A library can be obtained by amplification or ligation of DNA with custom adapter sequences.

Sequencing: Each library adapter is hybridized with covalently attached DNA linkers to amplify each library fragment. Amplification creates clusters of DNA that each originate from a single library fragment. Each cluster acts as an individual sequencing reaction. The main differences between the different NGS platforms lie in the technical details of the sequencing reactions, which could be categorized into: pyro-sequencing, sequencing by synthesis, sequencing by ligation, and ion semiconductor sequencing.

Data output: Each machine provides the raw data at the end of the sequencing run. It is the collection of DNA sequences generated at each cluster.

Even though current NGS platforms are very accurate, they are still prone to error. Even at accuracies greater than 99%, a sequence generated may contain incorrect nucleotides. This means that if a machine’s accuracy is 99%, one base pair is misread out of 100. Since NGS platforms generate high amounts of output (human genome 3.2 billion base pairs long), these errors can add up quickly. One of the ways to reduce errors is through quality control.

NGS Quality Control

Quality Control (QC) can identify issues before data analysis begins, ensure accurate results, and can enhance the interpretation of results. Before sequencing, quality control is determined through DNA quantification, cluster density/intensity, focus scores, index splitting. After sequencing, three levels of quality control can occur:

Non-Alignment based metrics
Alignment against a reference genome
Assessment after bioinformatic analysis

Level 1: Looks at unprocessed raw data. It is directly observed from the instrument and is an absolute measurement. The input is a fastq file from the sequencer. You observe the quality of the reads themselves; sequence count, per sequence quality score, base sequence content, etc…

Level 2: Looks at quality-controlled data. The data is associated with metadata and compared with calibrations. An aligner is used to compare against reference genomes. You observe mapping rate, duplication rate etc…

Level 3: Derives products that require scientific and technical interpretation. These standards are defined by the community that collects and utilizes the data. You observe assembly, expression levels, variant calling, etc…

With QC reports, you can detect problems with laboratory processes (uneven pooling, high ribosomal RNA content), improve protocols, compare results to previous experiments and reduce cost. QC prevents bad data from being incorporated into the analysis. It helps identify outliers, samples with known issues that may affect analysis results and explain observations in data when publishing results. It can be used to perform trend analysis. By looking at results over time, QC can provide a baseline by experiment type for comparison and identify areas of optimization in the lab and bioinformatic pipelines.

Use Dash and Dashboard Engine for Quality Control

Whether you have a pipeline that outputs QC reports, or you manually create QC reports for each run. Dash can streamline your findings by creating key summary statistics about the overall quality of the raw sequencing reads from a given sample. With Dash, you can observe performance trends more easily and can cross compare your generated outputs within the same graphs. With Dash Enterprise Report Engine, you can create, annotate, archive, and share point-in-time views of your Dash apps. Report Engine adds powerful capabilities to any QC reports. It allows you to:

Share a link to point-in-time Dash app views
Trigger emails and PDF reports programmatically
Enable on-demand snapshots through the Dash app UI
Automate nightly snapshots to archive Dash app state
Draw or comment directly on the Dash app canvas, then share by email

Suppose you or your scientists want to do a deeper dive into your dataset. You could turn to Dashboard Engine. This feature turns your dataset into a fully-featured Dash app by providing a powerful, drag-and-drop UI that is pre-built from this dataset. No callbacks or layout is required.

In the above example, I have parsed 3 fastq files into a pandas dataframe and used the Dashboard Engine to create QC for raw data.

Visualizing Alignment with AlignmentChart

Determining the similarity between two sequences is a common task in computational biology. Dash’s Alignment Chart helps researchers discover novel differences that appear in multiple sequences. The Alignment Chart component aligns multiple genomic or proteomic sequences from a FASTA or Clustal file. It displays three subplots showing gap and conservation info, alongside industry-standard color scale support for consensus sequence. Thanks to its underlying WebGL architecture powering the component, Alignment Chart can quickly display genes or proteins no matter your alignment’s size. In this example, we are going to compare Basiliscus Organisms.

Retrieve Sequence Information from Public Databases:

Using Alignment Chart in Jupyter Notebook/Colab:

Here is a link to a Google Colab where you can tinker more with the open source code!

Alignment Chart allows sequence alignments to be viewed quickly and directly. It works across platforms and on the browser and can be paired with other Dash Bio components such as Dash_bio.clustogram to calculate phylogenetic trees or Dash_bio.sequenceviewer to view specific regions of the sequence and convert it to it’s protein/amino acid composition.

Using DashBio components for your own analysis

Dash bio components build on each other, and you can quickly and easily create your own dashboard for deeper insights. Dashboard engine is particularly helpful if you want to give more agency to your scientist who run the sequencing to be able to build custom graphs they need outside of your pipeline for deeper analysis. Combining Dash Bio with Dash Enterprise 5.0, and you have a faster, more accurate way to gather detailed insight into sample quality.

If you like where we’re going with Dash, head over to our freshly minted GitHub project and give it a star! 🌟 https://github.com/plotly/dash-bio. And Check out some more Dash Bio apps in the Dash Gallery!

💊 If you’re a lab, chemical company, or drug development company, and you would like a customized Dash app or component built for you, please get in touch — we love a challenge. We also love giving Dash trainings if you’re re-thinking how analytics is done at your organization. Dash is an easy first Python library to learn, and we can help your team quickly get to Python-based productivity.

For more information, email us at info@plotly.com!