Organization, processing, analysis and clinical interpretation of next-generation sequencing (NGS) data.
The optimization of sequencing technologies for high throughput has made it possible to gain insights into these regulatory networks in a relatively short period of time. Starting from the study of individual genes up to the entire human genome. Thus, an important application area of bioinformatics is the development of specific software for the analysis and extraction of biological or clinically relevant information from high-throughput sequencing.
The raw data of each sequencing run has to pass multiple preprocessing steps before the results are diagnostically evaluated by a molecular biologist.
During the library preparation unique barcodes are added to each sample. After the sequencing this information is used to unequivocally assign the sequenced fragments (= reads) to the individual patients, automatically generating patient-specific FASTQ files. Converting raw sequencing data of a multiplexed run into patient-specific FASTQ files is called ‘demultiplexing’. Each patient and, hence, each FASTQ file is assigned a random ID to ensure the anonymity of the data. The FASTQ files serve as input for the alignment of the reads to the human reference genome, i.e. the sequenced DNA fragments are assigned to their matching region in the human genome based on their base sequence. The haploid human genome consists of around 3 billion bases, all of which are sequenced in the case of whole genome sequencing (WGS). In routine diagnostics the analysis is usually focused on a subset of genes or gene regions, which are associated with hematologic malignancies and are sequenced with considerable depth. In order to be able to even detect mutations of small clones, the target coverage is >400-2000x and several hundred million base pairs of sequence information are generated per patient. The exact determination of the genomic position of the reads in relation to the reference sequence is computationally very intense, but can be significantly accelerated by parallelisation. The results of the alignment are usually stored as a binary alignment/map (BAM) file.
Variant calling and variant annotation
Special algorithms are then used to scan the BAM files to identify variations compared with a human reference genome. Individual base exchanges (SNV, Single Nucleotide Variant), as well as smaller insertions and deletions can be detected. Subsequently, the variants are annotated, providing additional information about the detected variants. This includes the identification of the gene that overlaps with the variant, a precise characterization of the genomic region (exon, intron, intron-exon transition) in which the variant was found, a translation of the variant into a standardized nomenclature, an estimation of the possible functional effect of the found variant (missense, synonymous, polymorphism, etc.), and, if available, other relevant facts. In order to be able to assess whether the discovered sequence variants represent clinically relevant mutations or benign polymorphisms, the MLL compares them with clinical databases as well as with the in-house database. The generated data is subsequently transferred to the database system, enabling rapid comparison of results from different analytical methods and an early diagnosis for each patient.