Each human cell contains the complete diploid chromosome set in its cell nucleus (46, XY or 46, XX), which contains the entire genetic information of each person. The DNA that carries this information consists of 3 billion base pairs, which code for approx. 23,000 genes. Because each human cell contains the same DNA regardless of its function, DNA represents the most fundamental building block of the cell. Whole genome sequencing (WGS) aims to read a person’s complete genetic information, detect polymorphisms, and identify somatic mutations, which play an important role in cancer diagnostics.
The DNA of two persons is highly similar (99.9% identical), but differs by numerous polymorphisms (approx. 4 million different bases per person). Currently, approx. 10 million polymorphisms are known, which corresponds to approx. 1 change per 1,300 base pairs, which each person receives from their maternal and paternal chromosomes as germline variants during the fertilization of the ovum. In addition to clinically non-relevant polymorphisms, mutations associated with diseases can also be passed on, thereby leading to a congenital gene defect and hence a hereditary disease. However, over the course of a lifetime, a person also accumulates changes (somatic mutations), which under certain circumstances can lead to an illness such as cancer. Via the sequencing of stem cells, it was shown that a maximum of 1 mutation per stem cell could be detected in newborns, while the number of mutations rose to 8 to 12 in 70- to 80-year-olds (Welch et al., Cell, 2012). However, when these mutations occur in genes important for hematopoiesis, a clonal expansion of this stem cell may occur, thereby increasing the risk of hematological neoplasia (Jaiswal et al., NEJM, 2014).
Whole genome sequencing (WGS) aims to read a person’s complete genetic information, detect polymorphisms, and identify somatic mutations. Furthermore, WGS can also be used to detect additions to and the loss of chromosomal material (copy number variations, CNV) and the translocation of chromosomal material (structural variation, SV). In addition to the search for disease-associated mutations and changes, attempts are increasingly also being made to obtain predictive information from genome-wide data, e.g. the response to individual therapies (genome-wide association studies, GWAS). The more information there is about the tumor disease and the genetic background of the patient, the more efficient a targeted therapy can be in the future.
Preparing the DNA – Library preparation
There are two fundamentally different approaches for library preparation for WGS: PCR-free and DNA amplification. For the PCR-free method, a relatively large amount of input DNA is required (1µg), but it avoids PCR artifacts. Generally, sufficient DNA for a PCR-free library prep can be obtained from bone marrow and peripheral blood. If the raw material exists in the form of fixed tissue (formalin-fixed, paraffin-embedded; FFPE) or as cell-free DNA from liquid biopsy samples, a pre-amplification method must be chosen in order to obtain sufficient material for the sequencing. Library prep includes the fragmentation of the DNA, end repair, and adapter ligation, which contain unique indexes such that each individual read after the sequencing can be uniquely identified as belonging to a patient. At MLL, library prep is performed in a fully automated procedure by pipette robots (Hamilton NGS Star). This ensures standardized and homogeneous library prep.
At MLL, sequencing is performed using the Illumina sequencing by synthesis method on the latest generation of sequencing devices, the NovaSeq 6000. While a coverage (depth) of 30× is often sufficient in human genetics, the detection of somatic mutations, and hence small clones as well, is of great importance in tumor biology. Therefore, sequencing is usually performed with a coverage of > 60–90×.
Subsequently, data analysis is conducted. At MLL, the data from the sequencing devices is transferred directly to the Amazon Web Services (AWS) cloud in Frankfurt and analyzed in Illumina’s BaseSpace Sequence Hub. Data protection requirements are complied with in accordance with the EU General Data Protection Regulation (GDPR) and ensured via ISO 27001 certification (Cloud Computing). Firstly, the alignment of the reads (iSAAC, Illumina) to the reference genome takes place, i.e. the mapping of the fragments to their position in the genome. Subsequently, variant calling is performed (Strelka, Illumina), i.e. determining the changes for a patient as compared to a reference sequence (GRCh37, hg19). Usually, a “tumor-normal comparison” is performed here: By sequencing the tumor and e.g. peripheral blood as a normal control, the genome of a person can be compared for both materials, thereby allowing the differences in the tumor to be identified. In hematology, this is where we are faced with a huge challenge, as the frequently-used peripheral blood from patients with hematological neoplasia already contains the “tumor,” namely the leukemia cells, which means that it is not an option for an easily available normal control. Hence, we use a “tumor-unmatched normal” workflow in order to eliminate artifacts and a percentage of the polymorphisms. This involves utilizing sequences of healthy controls from other persons. For the further removal of irrelevant changes and the detection of CNVs (GATK, Broad Institute) and SVs (Manta, Illumina), in-house analysis pipelines are available at MLL.
Welch et al., Cell. 2012 Jul 20;150(2):264–78.
Jaiswal et al., N Engl J Med 2014; 371:2488–2498.
Unlike whole genome sequencing (WGS), whole exome sequencing (WES) focuses on the protein-coding region of the genome, which is called the exome. A person’s exome accounts for just 1% (approx.) of the genome, which is why only approx. 30 million base pairs are read during WES. However, the majority of disease-associated mutations and changes can be found in the exome, as the sequence changes occurring here have a direct effect on the structure and hence functionality of proteins, and can therefore modify the function of the cell.
Hence, although WES also allows gene mutations to be detected, it only provides an incomplete view of a patient’s genome. This means that procedures such as GWAS (genome-wide association studies), which also detect changes in non-coding regions, can only be performed to a limited extent. Chromosomal changes (structural variations, SV; copy number variations, CNV) can only be detected if they affect coding regions.
Preparing the DNA – Library preparation
In addition to the fragmentation of the DNA, end repair, and adapter ligation, which contain unique indexes such that each individual read after the sequencing can be uniquely identified as belonging to a patient, library preparation for WES also involves the enrichment of the coding sequences. Using probes, which exhibit a sequence complementary to the coding regions of the genome, the exome sequences can be specifically selected (capturing) and enriched. The xGen Exome Research Panel (IDT, Integrated DNA Technologies) uses 429,826 probes to enrich 39 Mb of genomic sequences (19,396 genes) and prepare them for sequencing. At MLL, library prep is performed in a fully automated procedure by pipette robots (Hamilton NGS Star). This ensures standardized and homogeneous library prep.
At MLL, sequencing is performed using the Illumina sequencing by synthesis method on the latest generation of sequencing devices, the NovaSeq 6000. Generally, a coverage (depth) of > 100× is striven for during WES, as the detection of somatic mutations, and hence small clones as well, is of great importance in tumor biology.
Subsequently, data analysis is conducted. At MLL, the data from the sequencing devices is transferred directly to the Amazon Web Services (AWS) cloud in Frankfurt and analyzed in Illumina’s BaseSpace Sequence Hub. Data protection requirements are complied with in accordance with the EU General Data Protection Regulation (GDPR) and ensured via ISO 27001 certification (Cloud Computing). Firstly, the alignment of the reads (iSAAC, Illumina) to the reference genome (GRCh37, hg19) takes place, i.e. the mapping of the fragments to their position in the genome. Subsequently, variant calling is performed, i.e. determining the changes for a patient as compared to a reference sequence. For the further filtering of relevant changes, in-house analysis pipelines are available at MLL.
Each cell in the human body has an identical copy of the genome (DNA) – the full set of genetic material. However, the transcriptomes of the cells differ. RNA sequencing (RNA-Seq) analyzes the transcriptome; i.e. it is a quantitative determination of the transcribed (from DNA to RNA) genes present in the cell. The expression of the transcriptome provides the basis for the identity of a cell and the associated functionality. In the case of an illness such as cancer, abnormal gene regulation occurs, which significantly modifies the transcriptome of the affected cells and influences the proportion of the genes transcribed.
Differentiated cells possess a specific repertoire of biological functions. For example, white blood cells play an important role in the immune system, red blood cells in the transportation of oxygen to the individual organs, and blood platelets in clotting. A particular set of genes is necessary for each of these functions, as well as for regulating the lifetime of a cell. Gene expression is controlled strictly via various mechanisms. In the case of an illness such as cancer, abnormal gene regulation occurs, which significantly modifies the transcriptome of the affected cells and influences the proportion of the genes transcribed. These changes can be detected and quantified using RNA-Seq, for example by comparing the transcriptome of the tumor cells with the profile of healthy cells.
In addition to changes in gene expression, RNA-Seq also allows fusion genes to be detected, which are the result of structural changes (translocations of chromosomal material). A person’s transcriptome contains not only protein-coding transcripts, but also transcripts that do not lead to the formation of a protein. These transcripts can be subdivided into two groups based on their length: short RNAs (microRNA, snoRNA, snaRNA, etc.) with a length of 20–24 bases and the long non-protein-coding RNAs (long non-coding RNAs, lncRNAs) with a length of over 200 bases. These transcripts are involved in the regulation of gene expression, making them a good starting point for interventions and therapies.
Preparing the RNA – Library preparation
As with the analysis of DNA (WGS, WES), library preparation is conducted prior to the sequencing of the transcriptome. This process includes the fragmentation of the RNA, the removal of ribosomal RNA, the synthesis of cDNA from the RNA, the ligation of uniquely identifiable indices that make it possible to tell one sample apart from another, and a subsequent enrichment of the material via PCR. At MLL, library prep is performed in a fully automated procedure by pipette robots (Hamilton NGS Star). This ensures standardized and homogeneous library prep.
The library prepared in this fashion is then input into the sequencing devices and read out using the sequencing by synthesis method. At MLL, the device used is the NovaSeq 6000, the latest generation of sequencing devices from Illumina. In order to achieve sufficient accuracy during the transcriptome analysis, the target is 50 million reads (sequenced fragments) per probe.
Subsequently, data analysis is conducted. At MLL, the data from the sequencing devices is transferred directly to the Amazon Web Services (AWS) cloud in Frankfurt and analyzed in Illumina’s BaseSpace Sequence Hub. Data protection requirements are complied with in accordance with the EU General Data Protection Regulation (GDPR) and ensured via ISO 27001 (Cloud Computing). Firstly, the alignment of the reads (STAR, Illumina) to the reference genome (GRCh37, hg19) takes place, i.e. the mapping of the fragments to their position in the genome. What follows is the determination of the counts, i.e. the number of reads per gene, which are then normalized in an internal MLL pipeline. The normalized counts constitute the starting point for all further analyses. For detecting fusion genes, a variant caller (Manta, Illumina) is used, which identifies deviations in the base order of the reads as compared to the reference sequence.
Apart from the DNA in cells, cell-free DNA can also be obtained from bodily fluids. In most cases, this refers to freely circulating DNA from the blood (cfDNA, cell-free DNA). It is assumed that this DNA is released from apoptotic cells. Tumors are characterized by high rates of proliferation and apoptosis. During the process of apoptosis, a cell goes through programmed cell death, which results in the cell breaking apart and DNA being released into the surrounding tissue. This type of diagnostics is called a “liquid biopsy.” It is a non-invasive method that is preferred for monitoring the progress of previously diagnosed instances of cancer and for assessing the response to a therapy without having to perform time-consuming tissue biopsies in the case of solid tumors.
Because the concentration of this cfDNA is extremely low, it must first be replicated using special amplification methods before it can be examined for changes (mutations). This allows tests to be performed on whether cells from the residual tumor are still present in the body, which would increase the risk of recurrence. In addition, intensive work is being done on the development of tests that aim to enable early cancer detection using cell-free DNA from blood. Generally, in addition to cfDNA, a second type of DNA can be obtained using a liquid biopsy: cell-bound DNA from freely circulating tumor cells (CTCs). These indicate a possible metastasis of the primary tumor. The value of the liquid biopsy with the detection of cfDNA in patients with lymphomas is currently being evaluated.
Preparing the DNA – Extraction and library preparation
For extracting cfDNA, 10 ml of blood is drawn from the patient in special blood vials. In these vials, the blood is anticoagulated, stabilized, transported, and can be stored for up to 7 days. In the vials, the hemolysis and apoptosis of the blood cells is inhibited, such that no cellular DNA from decaying blood cells enters the plasma. The cfDNA can now be selectively isolated from the plasma. Special extraction kits allow the cfDNA to be isolated from large volumes (approx. 10 ml of plasma) and eluted in a small volume (20µl) in order to concentrate the cfDNA (which occurs in very low concentrations) in a small volume.
Subsequently, the cfDNA thus obtained can be analyzed using PCR or next-generation sequencing and examined for markers that characterize tumors. For the library preparation preceding the sequencing, it should be emphasized that cfDNA is frequently very highly fragmented (~180 bp long) and present in extremely low concentrations, such that pre-amplification library preps in which the quantity of DNA is first replicated should be used.
Modern analysis procedures provide doctors with an ever-growing quantity of information, the analysis of which is becoming increasingly difficult without the aid of computers. This flood of data has led to a dramatic transformation taking place over the past two decades in the field of data analysis. While programming was originally designed to allow computers to be taught how to solve problems via well-defined rules, today’s computers are increasingly capable of independent learning and hence developing artificial intelligence.
Instead of specifying the rules for the computer, sample data is collected (e.g. images, texts, audio), from which the algorithm (= computer program) independently selects and extracts the relevant information and then creates its own rules. Such algorithms are frequently based on how the human brain works, which is why they are called “neural networks.”
Classic pattern recognition in the medical field
In the field of medicine, there are various areas in which artificial intelligence is being used or can be utilized. The use of neutral networks is particularly advanced in the field of image recognition. According to WHO guidelines, leukemia diagnostics continues to be strongly defined by cytomorphology. Cytomorphology focuses on the assessment of blood and bone marrow smears for describing and differentiating between malignant and healthy cells. For this purpose, the morphologist looks for abnormal patterns with regard to appearance and number of the various cell types, which he then classifies according to predefined guidelines. This manual process is comparatively time-consuming and the quality of the results depends greatly on the morphologist’s experience. A standardized and automated procedure that assists the morphologist in his work via pattern recognition is therefore desirable.
Pattern recognition is a multi-layered process that begins with segmentation. In this step, the aim is to recognize relevant shapes (e.g. a cell) and to distinguish them from image artifacts such as dirt. Furthermore, additional sub-structures also need to be identified within the shapes found, e.g. defining the nucleus of a cell. It is obvious that a defined set of rules according to which these distinctions are made will not be able to sufficiently cover all possible eventualities that may occur, and would very quickly become overly extensive and time-consuming. Due to this, it has proved to be more efficient to train the algorithm using sample images. Additionally, this method also has the advantage that the computer is able to make generalizations based on the information learned, namely that it will also be able to identify cells that do not exactly correspond to the sample images as such. What needs to be done here is to achieve a balance between generalization and specificity. This fine-tuning is an iterative process, the accuracy of which increases along with the amount of data – the more data is available, the more accurate the evaluation. Similarly, a morphologist’s accuracy also increases with experience – the more time he has spent in front of a microscope and the more extensive the range of smears examined, the more accurate and rapid his assessment.
Similar procedures can be applied in all areas which rely primarily on the analysis of image files. For example, it is also conceivable that with the continued improvement in the algorithms for automatic image recognition, these techniques will be used in cytogenetics to enable automatic identification and categorization of chromosomal aberrations in recorded metaphases. The same applies for immunophenotyping, in which malignant cells are distinguished from healthy ones based on their antigen expression pattern through the use of flow cytometry. The individual cell types are characterized by the expression of specific antigen combinations. The diagnosis of the various hematological neoplasias is performed via the interpretation of the two-dimensional images recorded from flow cytometry. Each analysis involves measuring thousands of cells, which greatly increases the quantity of data. With the great amount of available image material, artificial neural networks can be trained and used for the automatic interpretation of the data obtained. To this end, MLL collaborates closely with various institutions to further research in this area.
Automatic classification in the medical field
In molecular genetics, the increase in the number of sequencing procedures carried out has resulted in ever-growing quantities of data, making manual interpretation of the data increasingly difficult. While investigations were previously limited to individual genes, high-throughput sequencing methods allow for simultaneous examination of the entire genome (WGS) and/or transcriptome (RNA-Seq). The goal of these methods is not only the high-throughput gene-specific analysis of changes and/or overexpressions, but more importantly uncovering underlying regulatory mechanisms and identifying recurring genetic patterns. For example, are there certain combinations of genetic changes that characterize the clinical picture of a particular type of leukemia? Various molecular markers that distinguish the individual sub-types of leukemia from each other are already known, but this knowledge is currently still limited. The immense quantities of data make it impossible to manually sift through the genomic data, and because we do not know what we are looking for, it is also not possible to explain to a computer what it should be searching for. For this reason, machine learning methods that learn independently from the data and extract relevant information are used. This approach has two objectives: For one, to achieve an automatic classification of unknown samples, and for another, to obtain additional insight into the fundamental principles of the various illnesses. In order to ensure that this works, the algorithm, which in this case frequently consists of neural networks, is trained using the genomic data of the various sub-types and its performance evaluated. This is a highly iterative process that aims to find the optimal setting for the parameters, which guarantees the best performance and hence the most accurate classification. Even when the genomes of different persons are 99.9% identical, they still differ in terms of numerous polymorphisms. In order to prevent these individual differences from negatively influencing the performance of the classifier, large quantities of training data are necessary to cover the resulting diversity and guarantee accurate estimates. Because the algorithm independently searches for the characteristics of the individual sub-types, it stands to reason that this would also allow new correlations and associations to be found, which could help with obtaining a better understanding of the basic molecular principles. Together with known characteristics from routine diagnostics, the aim here is to enable improved diagnosis and prognosis assessments.
Apart from allowing for the storage of large quantities of data, cloud computing also enables the data to be processed and analyzed rapidly, as the required calculations can be performed in a highly parallel fashion. Thanks to cloud computing, we have computing capacity available for research projects that allows for rapid processing of the data, something that could only be realized in-house at great cost.
Cloud computing has made it possible for us to process the WGS data from the 5,000 genome project directly in the cloud via Illumina’s BaseSpace Sequence Hub. Furthermore, we can use our private domain in the cloud to upload our proprietary software and use it to analyze the data without it needing to be transferred. This means that the data is available directly to us for all analyses and scientific queries.
Where there is data analysis, data management is also necessary, as WGS in particular produces a great amount of data (~130 GB per patient for 90× coverage). For this, a custom infrastructure needs to be available, which allows not only for the analysis, but also the storage of the data. In the past, there was great skepticism regarding the cloud for data processing, and even more so as a data storage solution. However, the increasingly staggering amounts of data being produced today make it ever clearer that not only the hardware, but also the maintenance of IT infrastructure comes at a high cost. Hence, it is usually easier and more economical for specialized cloud providers to remain up to date both where safety and hardware are concerned, as well as offer the highest security standards. MLL’s WGS, WES and RNA-Seq data are located completely anonymized in a private AWS instance of Amazon Cloud in Frankfurt (AWS, Amazon Web Services), to which only special employees at MLL have access. The data stored there is exclusively sequence data that has an arbitrary MLL_Identifier. No personal data whatsoever is stored in the AWS instance, such as clinical parameters or personal information. The data security measures comply with the highest standards of the new EU General Data Protection Regulation (GDPR), which has also been verified by external auditors in their reports, including ISO 27001, ISO 27017 and ISO 27018. Furthermore, AWS has also been awarded the C5 attestation of the Federal Office of Information Security (BSI).