Large-scale sequencing experience from the MLL 5K project

Beitrag bewerten

Estimated reading time: 5 minutes

The project was started in 2017 with the aim to sequence the genome and transcriptome of 5,000 patients with haematological malignancies, in order to obtain a more in-depth knowledge of their molecular profiles to better understand the underlying complexity, to refine diagnosis, and also to take another step in the direction of personalized medicine. 

The MLL Biobank covers more than 40 entities of haematological malignancies, including rare forms of leukemia and lymphoma. The biobank doesn’t only give us the opportunity to choose from a broad range of leukemia and lymphoma sub-groups but it also provides sufficient DNA material (1µg) for a PCR-free library preparation (TruSeq PCR-free, Illumina). The advantage of PCR-free library preparations is to avoid PCR artifacts that potentially affect downstream analyses but the application of such kits is usually limited by low amounts of input material.

Sequencing statistics

Within one year more than 3,500 genomes with 90x coverage (150bp paired-end) and 3,500 transcriptomes (100bp paired-end) with a median of 50 Mio reads were sequenced on the HiSeqX and NovaSeq 6000 systems (Illumina). At the beginning of the project only the S2 flow cell was available for the NovaSeq System, limiting the maximum throughput of 5 NovaSeqs to ~90 genomes (90x) per week. With the release of the S4 flow cell in October 2017 the throughput could be increased to 200 genomes and 192 transcriptomes a week. The median cluster passed filter (PF) rate for a S2 run was 68.44% (range 64.22% - 74.59%) with a median Q30 score of 89.24%. For a S4 run the median %PF was 67.04% (range 63.22% - 74.05%) with a median Q30 score of 88.99%.

Library quantification and normalisation

The quality of sequencing and a homogenous read distribution largely depend on exact library quantification. For DNA library quantification multiple methods are available of which quantitative real-time PCR (qPCR) and fluorometry (QuBit) are the most frequently used ones. Newer studies also provide the impetus to use ddPCR-based quantification [Robin et al. 2016]. Currently no unambiguous gold standard exists, but the qPCR is the recommended method by Illumina and, hence, the KAPA Library Quantification Kit (Roche) was used to quantify the WGS libraries. However, in order to get the maximum yield and to optimally use the flow cell capacities while guaranteeing equal coverage for all samples at the same time, another normalization step was added: in the first run the samples were sequenced targeting 45x, based on the obtained read distribution the sample concentration was adjusted and a second run was started to top off the coverage of each sample to 90x. The strategy was especially useful for S4 flow cells, increasing the number of 90x genomes in two runs from 16 (source: Illumina specifications) to 18-19. This strategy, including the quantification itself, might soon be replaced by performing quality control (QC) using the iSeq 100 System. The recently published white paper from Illumina demonstrates that the iSeq 100 System enables efficient rebalancing of DNA libraries for sequencing runs on the NovaSeq 6000 system. Availability of specific QC kits for the iSeq 100 System would make the iSeq 100 System a noteworthy QC solution that could save time and money.     


During the sequencing the data is transferred directly to the Amazon Web Services (AWS) cloud in Frankfurt and pre-processed in Illumina’s BaseSpace Sequencing Hub. The Illumina tumor/unmatched normal workflow was used for variant calling. Here, a mixture of genomic DNA from multiple anonymous donors was used as normal controls. In total four normal controls were used: one control per gender and sequencing platform. For each sequencing platform two patients were sequenced twice with comparable coverage (Table 1) to estimate the reproducibility on the same platform. In addition, two patients were sequenced on both systems to estimate the reproducibility between the two platforms (Table 1). However, the used normal controls for the different platforms had a slightly larger coverage difference (~40x) that might impact the variant calling.

Table 1: Sample overview


3HiSeq X127.18
3HiSeq X127.58
4HiSeq X121.65
4HiSeq X126.35
5HiSeq X140.94
6HiSeq X125.32


In general, the number of called variants was dependent on the coverage of the sample for both platforms. However, a slightly higher number of variants was called for the HiSeq X samples compared to the NovaSeq samples. For runs on the same instrument the reproducibility of ‘PASS’ variant calls was rather good with a percentage overlap between 92% and 95.5%. The relative overlap of ‘PASS’ variants between the two platforms on the other hand was worse with 67.2% and 83.6%, respectively. This might partially be attributed to the difference in coverage of the platform-specific normal samples. However, even if the overall number of variants is lower for the samples produced on the NovaSeq system the number of unique ‘PASS’ variants is higher compared to the HiSeq X sample. Hence, the data quality of the NovaSeq samples seems to be slightly better compared to the HiSeq X samples. 


Robin, Jérôme D., et al. "Comparison of DNA quantification methods for next generation sequencing." Scientific reports 6 (2016): 24067.