DEV Community

Cover image for Sentieon | Application Tutorial: Accelerating Custom Algorithms Using the Sentieon Python API Engine
毅硕科技
毅硕科技

Posted on

Sentieon | Application Tutorial: Accelerating Custom Algorithms Using the Sentieon Python API Engine

Background

All modules in the Sentieon suite are several times to dozens of times faster than their corresponding open-source software. While using these modules, users sometimes hope that the Sentieon team can help accelerate their custom-developed software. To help these users enjoy the speed of Sentieon modules in their own software, we have developed a Python API system to meet the needs of secondary development and self-acceleration.

API Introduction

*The Sentieon Python API is essentially a communication system that connects users' data analysis scripts with Sentieon's high-speed engine, accelerating the process while also improving the readability and maintainability of the scripts.
*

Sentieon's data processing engine is the core of multiple Sentieon modules, capable of high-speed analysis of BAM/CRAM and FASTA format data files. The engine supports both single-pass and multithreaded execution data flow methods. The multithreaded data flow is faster but relatively more complex. It divides the genome into fragments with a default length of 1Gb. The Sentieon engine processes each fragment independently in parallel across multiple threads. Each fragment is further divided into smaller segments (Steps) with a default length of 1Kb, which the engine processes linearly. During this process, the data processing logic of the user's software will be executed at high speed.

Image description

Implementation Case Study

We will demonstrate the acceleration effect of Sentieon through a collaborative case with the CREST software team from St. Jude Children's Research Hospital in the United States. CREST (Clipping REveals STructure) is a well-known software in the industry for detecting structural variations in cancer genomes, primarily using breakpoints as clues to identify structural variations in the genome. Specifically, the CREST software workflow includes steps such as soft-clip detection, assembly, post-assembly alignment, breakpoint confirmation, and structural variation confirmation. The assembly and alignment steps mainly rely on third-party tools. While CREST's advantage lies in its high accuracy, its speed limitations are equally apparent. For a standard 30x tumor whole-genome paired sample, the processing time on a 20-thread workstation can take up to 24 hours, which is often insufficient to meet user demands.

Image description

After learning about the capabilities of the Sentieon Python API, the CREST team reimplemented CREST's functionality using this system. In test data, the Sentieon-accelerated version of CREST achieved a 10-fold speed increase, with results identical to the original CREST. On a 20-thread workstation, it reduced the processing time for the vast majority of samples to under 1 hour.

Image description

Next, we'll introduce two more application acceleration case studies. Quality control is a crucial step in NGS data processing workflows. Although the logic is relatively simple, it involves extensive reading of BAM/CRAM files. These tools often struggle to balance speed, multi-threaded parallelism, and code maintainability.

The Sentieon Python API can separate the algorithmic logic of quality control tools from data reading, simultaneously improving speed and code readability. As implementation examples, we used the Python API to accelerate Picard's CollectInsertSizeMetrics tool for rapid insert size statistics, and GATK's CalculateTargetCoverage tool for quick depth statistics in target regions. Users can also refer to these cases to accelerate their own custom quality control tools.

Image description

Technical Support

The Sentieon Python API allows users' scripts to communicate with the Sentieon engine, enabling high-speed parallel reading of BAM/CRAM/FASTA files, resulting in over 10-fold speed improvements.** Users can utilize this platform for secondary development to accelerate their custom software**. We are more than willing to provide comprehensive technical support.

Introduction to Sentieon Software

Sentieon offers a comprehensive, purely software-based solution for secondary analysis in genetic variant detection. Its analysis pipeline remains entirely faithful to the mathematical models of gold standards such as BWA, GATK, MuTect2, STAR, Minimap2, Fgbio, Picard, and others. While matching the analysis results of open-source workflows, Sentieon significantly improves the analysis efficiency and detection accuracy for sequencing data from WGS, WES, Panel, UMI, ctDNA, RNA, and other sources. It is compatible with all current second and third-generation sequencing platforms.

Image description

The Sentieon software team possesses rich experience in software development and algorithm optimization engineering. They are dedicated to solving speed and accuracy bottlenecks in biological data analysis, providing efficient and precise software solutions for partners from various fields such as molecular diagnostics, drug development, clinical medicine, population cohorts, and animal and plant research, jointly promoting the development of genetic technology.

As of the end of 2023, Sentieon has provided services to over 1300 users worldwide and has been widely cited in top-tier impact factor journals such as NEJM, Cell, and Nature, with nearly a thousand citations. Furthermore, Sentieon has consistently won accolades in authoritative evaluations such as Precision FDA and Dream Challenges for several consecutive years, gaining widespread recognition in the industry.

Top comments (0)