Reducing the complexity of OMICS data analysis

PhD Thesis – Beat Wolf: Reducing the complexity of OMICS data analysis Development of bio-informatic tools to make genetic diagnostics both faster and easier.

DNA sequence analysis is a complex and time consuming process. While the amount of data produced that needs to be analysed rapidly increases, neither the available computational resources nor human resources grow fast enough to keep up with the analysis. The goal of this thesis is to approach both of those bottlenecks in DNA and RNA analysis in the context of diagnostics.

The basis of this thesis is the GensearchNGS project which was developed as a prototype during the author’s Master project. Large parts of the the prototype where rewritten and extended depending on the needs of the diagnostics laboratories, always with the goal to lower the technical and computational complexity. This is achieved through an intuitive user interface, hiding as much of the complexity from the user as possible, to increase the amount of people able to perform the analysis.

The other approach is to create more efficient algorithms, lowering the time to do the analysis, as well as lowering the infrastructure requirements. Parts of this effort have been released in the GNATY project, which is a collection of free tools to perform common bio-informatics tasks in a more efficient manner.

When better algorithms and optimized code reach their limits on what is possible to speed up the data analysis, we turn to distributed computing. This is achieved among other techniques with the POP-Java programming language, which has been vastly improved and extended over the course of this thesis.

The thesis is a collaboration between the institute iCoSys at the HES-SO and the Departement of bioinformatics at the University of Würzburg.

Publications

M. Kunz, B. Wolf, H. Schulze, D. Atlan, T. Walles, H. Walles, and T. Dandekar, “Non-coding RNAs in lung cancer: Contribution of bioinformatics analysis to the development of non-invasive diagnostic tools: Human Genetics and Genomics,” Human Genetics and Genomics, 2016.
[Bibtex]

@article{kunz:2606,
author = "Meik Kunz and Beat Wolf and Harald Schulze and David Atlan
and Thorsten Walles and Heike Walles and Thomas Dandekar",
title = "Non-coding RNAs in lung cancer: Contribution of
bioinformatics analysis to the development of non-invasive
diagnostic tools: Human Genetics and Genomics",
year = "2016",
journal = "Human Genetics and Genomics",
abstract = "Lung cancer is currently the leading cause of cancer
related mortality due to late diagnosis and limited
treatment intervention. Non-coding RNAs are not translated
into proteins and have emerged as fundamental regulators of
gene expression. Recent studies reported that microRNAs and
long non-coding RNAs are involved in lung cancer development
and progression. Moreover, they appear as new promising
non-invasive biomarkers for early lung cancer diagnosis.
Here, we highlight their potential as biomarker in lung
cancer and present how bioinformatics can contribute to the
development of non-invasive diagnostic tools. For this, we
discuss several bioinformatics algorithms and software tools
for a comprehensive understanding and functional
characterization of microRNAs and long non-coding RNAs.",
keywords = "lung cancer",
keywords = "non-invasive biomarkers",
keywords = "miRNAs",
keywords = "lncRNAs",
keywords = "early diagnosis",
keywords = "bioinformatics",
keywords = "algorithm",
}

B. Wolf, P. Kuonen, and T. Dandekar, “GNATY: Optimized NGS variant calling and coverage analysis : IWBBIO 2016. International Work-Conference on Bioinformatics and Biomedical Engineering,” IWBBIO 2016, 2016.
[Bibtex]

@article{Wolf:2607,
author = "Beat Wolf and Pierre Kuonen and Thomas Dandekar",
title = "GNATY: Optimized NGS variant calling and coverage analysis
: IWBBIO 2016. International Work-Conference on
Bioinformatics and Biomedical Engineering",
month = "avr",
year = "2016",
journal = "IWBBIO 2016",
abstract = "Next generation sequencing produces an ever increasing
amount of data, requiring increasingly fast computing
infrastructures to keep up. We present GNATY, a collection
of tools for NGS data analysis, aimed at optimizing parts of
the sequence analysis process to reduce the hardware
requirements. The tools are developed with efficiency in
mind, using multithreading and other techniques to speed up
the analysis. The architecture has been verified by
implementing a variant caller based on the Varscan 2 variant
calling model, achieving a speedup of nearly 18 times.
Additionally, the flexibility of the algorithm is also
demonstrated by applying it to coverage analysis. Compared
to BEDtools 2 the same analysis results were found but in
only half the time by GNATY. The speed increase allows for a
faster data analysis and more flexibility to analyse the
same sample using multiple settings. The software is freely
available for non-commercial usage at
http://gnaty.phenosystems.com/",
keywords = "Next generation sequencing",
keywords = "Variant calling",
keywords = "Algorithmics",
}

E. Bach, B. Wolf, J. Oldenburg, C. Müller, and S. Rost, “Identification of deep intronic variants in 15 haemophilia A patients by next generation sequencing of the whole factor VIII gene.: Thrombosis and Haemostasis,” Thrombosis and Haemostasis, p. 10, 2015.
[Bibtex]

@article{Elisa:2175,
author = "Bach,Elisa and Wolf, Beat and Oldenburg, Johannes and Müller, Clemens and Rost, Simone",
title = "Identification of deep intronic variants in 15 haemophilia
A patients by next generation sequencing of the whole factor
VIII gene.: Thrombosis and Haemostasis",
year = "2015",
journal = "Thrombosis and Haemostasis",
pages = "10",
issn = "0340-6245",
abstract = "Current screening methods for factor VIII gene (F8)
mutations can reveal the causative alteration in the vast
majority of haemophilia A patients. Yet, standard diagnostic
methods fail in about 2% of cases. This study aimed at
analysing the entire intronic sequences of the F8 gene in 15
haemophilia A patients by next generation sequencing. All
patients had a mild to moderate phenotype and no mutation in
the coding sequence and splice sites of the F8 gene could be
diagnosed so far. Next generation sequencing data revealed
23 deep intronic candidate variants in several F8 introns,
including six recurrent variants and three variants that
have been described before. One patient additionally showed
a deletion of 9.2 kb in intron 1, mediated by Alu-type
repeats. Several bioinformatic tools were used to score the
variants in comparison to known pathogenic F8 mutations in
order to predict their deleteriousness. Pedigree analyses
showed a correct segregation pattern for three of the
presumptive mutations. In each of the 15 patients analysed,
at least one deep intronic variant in the F8 gene was
identified and predicted to alter F8 mRNA splicing. Reduced
F8 mRNA levels and/or stability would be well compatible
with the patients' mild to moderate haemophilia A
phenotypes. The next generation sequencing approach used
proved an efficient method to screen the complete F8 gene
and could be applied as a one-stop sequencing method for
molecular diagnostics of haemophilia A.",
keywords = "factor VIII",
keywords = "haemophilia A",
keywords = "next generation sequencing",
keywords = "Alternative splice sites",
keywords = "deep intronic variant",
}

B. Wolf, P. Kuonen, and T. Dandekar, “GNATY: A tools library for faster variant calling and coverage analysis,” German Conference on Bioinformatics, 2015.
[Bibtex]

@article{Wolf:2174,
Abstract = {Following the speed increases in next generation sequencing over recent years, the proportion of time spent in sequence analysis compared to sequencing has increasingly shifted towards sequence analysis. While certain analysis steps such  as sequence alignment were able to benefit from various speed increases, others, equally important steps like variant calling or coverage analysis, did not receive the same improvements. Analysing NGS data remains a complicated and time consuming process, requiring a substantial amount of computing power. Most current approaches to address the increasing data quantity rely on the usage of more powerful hardware or offload calculations to the cloud. In this poster we show that by using modern software development techniques such as stream processing, those additional analysis steps can be sped up without changing the analysis  results. Developing more efficient implementations of existing algorithms makes it possible to process larger  datasets on existing infrastructure, without changing the analysis results. This not only reduces the overall cost of data analysis, but also gives researches more flexibility  when exploring different settings for the data analysis. We present the application GNATY, a stand-alone version of NGS data analysis tools used in GensearchNGS [WKDD15] developed by Phenosystems SA. The goal during the development of the GNATY tools was not to create new methods with different results to existing approaches, but explore the possibilities of improving the efficiency of existing approaches. A modular architecture has been developed to  create efficient sequence alignment analysis tools, using stream processing techniques which allow for multithreading and reusable data analysis blocks. The modular architecture uses a stream processing based workflow, efficiently  splitting data access and data processing analysis steps, resulting in a more efficient use of the available computing resources. The architecture has been verified by implementing a variant caller based on the Varscan 2 [KZL+12] variant calling model, achieving a speedup of nearly 18 times. The results of the variant calling in GNATY are identical to Varscan 2, avoiding the issue of adding yet another variant calling model to the existing ones. To further demonstrate the flexibility and efficiency of the approach, the algorithm is also applied to coverage analysis. Compared to BEDtools 2 [QH10], GNATY was twice as fast to perform coverage analysis, while producing the exact same results. Through the example of 2 existing next generation sequencing data analysis algorithms which are reimplemented with an efficient stream based modular architecture, we show the performance potential in existing  data analysis tools. We hope that our work will lead to more efficient algorithms in bioinformatics in general, lessening the hardware requirements to cope with the ever increasing amounts of data to be analysed. The developed GNATY software is freely available for non-commercial usage at http://gnaty.phenosystems.com/.},
Author = {Beat Wolf and Pierre Kuonen and Thomas Dandekar},
Journal = {German Conference on Bioinformatics},
Keywords = {Genetics},
Month = {sep},
Pdf = {https://peerj.com/preprints/1350.pdf#page=16},
Title = {GNATY: A tools library for faster variant calling and  coverage analysis},
Year = {2015}}

B. Wolf, L. Monney, and P. Kuonen, “FriendComputing: Organic application centric distributed computing,” Nesus 2015 workshop, 2015.
[Bibtex]

@article{Wolf:2173,
Abstract = {Building Ultrascale computer systems is a hard problem, not yet solved and fully explored. Combining the computing resources of multiple organizations, often in different administrative domains with heterogeneous hardware and diverse demands on the system, requires new tools and frameworks to be put in place. During previous work we developed POP-Java, a Java programming language extension that allows to easily develop distributed applications in a heterogeneous environment. We now present an extension to the POP-Java language, that allows to create application centered networks in which any member can benefit from the computing power and storage capacity of its members. An accounting system is integrated, allowing the different members of the network to bill the usage of their resources to the other members, if so desired. The system is expanded through a similar process as seen in social networks, making it possible to use the resources of friend and friends of friends. Parts of the proposed system has been implemented as a prototype inside the POP-Java programming language.},
Author = {Beat Wolf and Loic Monney and Pierre Kuonen},
Journal = {Nesus 2015 workshop},
Keywords = {Distributed computing},
Month = {sep},
Pdf = {http://e-archivo.uc3m.es/bitstream/handle/10016/22003/friendcomputing_NESUS_2015.pdf},
Title = {FriendComputing: Organic application centric distributed computing},
Year = {2015}}

B. Wolf, P. Kuonen, and T. Dandekar, “Multilevel parallelism in sequence alignment using a streaming approach,” Nesus 2015 workshop, 2015.
[Bibtex]

@article{Wolf:2172,
Abstract = {Ultrascale computing and bioinformatics are two rapidly growing fields with a big impact right now and even more so in the future. The introduction of next generation sequencing pushes current bioinformatics tools and workflows to their limits in terms of performance. This forces the tools to become increasingly performant to keep up with the growing speed at which sequencing data is created. Ultrascale computing can greatly benefit bioinformatics in the challenges it faces today, especially in terms of scalability, data management and reliability. But before this is possible, the algorithms and software used in the field of bioinformatics need to be prepared to be used in a heterogeneous distributed environment. For this paper we choose to look at sequence alignment, which has been an active topic of research to speed up next generation sequence analysis, as it is ideally suited for parallel processing. We present a multilevel stream based parallel architecture to transparently distribute sequence alignment over multiple cores of the same machine, multiple machines and cloud resources. The same concepts are used to achieve multithreaded and distributed parallelism, making the architecture simple to extend and adapt to new situations. A prototype of the architecture has been implemented using an existing commercial sequence aligner. We demonstrate the flexibility of the implementation by running it on different configurations, combining local and cloud computing resources.},
Author = {Beat Wolf and Pierre Kuonen and Thomas Dandekar},
Journal = {Nesus 2015 workshop},
Keywords = {Genetics},
Month = {sep},
Pdf = {http://e-archivo.uc3m.es/handle/10016/22004},
Title = {Multilevel parallelism in sequence alignment using a streaming approach},
Url = {http://e-archivo.uc3m.es/bitstream/handle/10016/22004/multilevel_NESUS_2015.pdf},
Year = {2015},
Pdf = {http://e-archivo.uc3m.es/bitstream/handle/10016/22004/multilevel_NESUS_2015.pdf}}

B. Wolf, P. Kuonen, T. Dandekar, and D. Atlan, “DNAseq Workflow in a Diagnostic Context and an Example of a User Friendly Implementation,” BioMed Research International, vol. 2015, p. 11, 2015.
[Bibtex]

@article{Wolf:2171,
Abstract = {Over recent years next generation sequencing (NGS) technologies evolved from costly tools used by very few, to a much more accessible and economically viable technology. Through this recently gained popularity, its use-cases expanded from research environments into clinical settings. But the technical know-how and infrastructure required to analyze the data remain an obstacle for a wider adoption of this technology, especially in smaller laboratories. We present GensearchNGS, a commercial DNAseq software suite distributed by Phenosystems SA. The focus of GensearchNGS is the optimal usage of already existing infrastructure, while keeping its use simple. This is achieved through the integration of existing tools in a comprehensive software environment, as well as custom algorithms developed with the restrictions of limited infrastructures in mind. This includes the possibility to connect multiple computers to speed up computing intensive parts of the analysis such as sequence alignments. We present a typical DNAseq workflow for NGS data analysis and the approach GensearchNGS takes to implement it. The presented workflow goes from raw data quality control to the final variant report. This includes features such as gene panels and the integration of online databases, like Ensembl for annotations or Cafe Variome for variant sharing.},
Author = {Beat Wolf and Pierre Kuonen and Thomas Dandekar and David Atlan},
Journal = {BioMed Research International},
Keywords = {Genetics},
Month = {may},
Pages = {11},
Pdf = {http://e-archivo.uc3m.es/bitstream/handle/10016/22004/multilevel_NESUS_2015.pdf},
Title = {DNAseq Workflow in a Diagnostic Context and an Example of a User Friendly Implementation},
Volume = {2015},
Year = {2015}}

B. Wolf, P. Kuonen, T. Dandekar, and D. Atlan, “GensearchNGS: Interactive variant analysis,” 13th International Symposium on Mutation in the Genome: detection, genome sequencing & interpretation, 2015.
[Bibtex]

@article{Wolf:2090,
Abstract = {NGS data analysis is increasingly popular in the diagnostics field thanks to advances in sequencing technologies which improved the speed, quantity and quality of the produced data. Due to those improvements, the analysis of the data requires an increasing amount of technical knowledge and processing power. Several software tools exist to handle these technical challenges involved in NGS data analysis. We present the latest improvements in one of those software-tools, GensearchNGS 1.6, a NGS data analysis software allowing users to go from raw NGS data to variant reports. We focus on the improvements made in terms of variant calling, annotation and filtering. The variant calling algorithm has been completely rewritten, based on the variant calling model used in Varscan 2, greatly improving its speed (over 10 times faster than Varscan 2, over 5 times faster than GATK) and accuracy, while reducing memory requirements. For the subsequent annotation of the called variants, various new datasources have been integrated, such as Human Phenotype Ontology and the clinical predictions from Ensembl, which give the user more information about the clinical relevance of the called variants. An initial prototype of the integration of interactome data from different sources, such as CCSB or BioGRID, is also presented, further increasing the available information for variant effect prediction. The addition of annotation data has been accompanied by various optimizations, keeping memory requirements and analysis times stable. The interactive variant filtering, which updates a variant list presented to the user while he changes the filters, has been further optimized, making it possible to filter variants interactively even on computers with limited processing power and memory. Similar improvements have also been made to the visualizer, allowing for a faster visualization requiring fewer resources, while integrating more data, such as the previously mentioned databases.},
Author = {Beat Wolf and Pierre Kuonen and Thomas Dandekar and David Atlan},
Journal = {13th International Symposium on Mutation in the Genome: detection, genome sequencing & interpretation},
Keywords = {Diagnostique genetique},
Month = {avr},
Pdf = {http://www.hindawi.com/journals/bmri/2015/403497/},
Title = {GensearchNGS: Interactive variant analysis},
Year = {2015}}

B. Wolf, P. Kuonen, T. Dandekar, and D. Atlan, “Speeding up NGS analysis through local and remote computing resources,” The EUROPEAN HUMAN GENETICS CONFERENCE, 2015.
[Bibtex]

@article{eshg2015,
Abstract = {The explosion of NGS data which requires increasingly fast computers to keep up with the analysis pushed smaller laboratories to their limits. We previously presented a way for GensearchNGS users to distribute sequence alignment over multiple computers. This possibility has now been expanded to combine multiple computers in the same network with cloud computing resources. For our prototype, the user can by request add Amazon AWS EC2 cloud instances to the alignment process. The cloud resources are dynamically created and destroyed on demand, transparently to the user. It is also possible to combine local alignment on the computer starting the alignment, distributed alignment on multiple computers in the same network and cloud computing in any possible configuration. Even completely offloading the alignment is now possible. This flexibility allows especially smaller laboratories to adapt the software configuration to their needs.},
Author = {Beat Wolf and Pierre Kuonen and Thomas Dandekar and David Atlan},
Journal = {The EUROPEAN HUMAN GENETICS CONFERENCE},
Keywords = {Diagnostique g{\'e}n{\'e}tique},
Title = {Speeding up NGS analysis through local and remote computing resources},
Year = {2015}}

B. Wolf, P. Kuonen, and T. Dandekar, “POP-Java : Parallélisme et distribution orienté objet: Compas2014 (Conférence d’informatique en Parallélisme, Architecture et Système) . Compas2014 (Conférence d’informatique en Parallélisme, Architecture et Système),” Compas2014 (Conférence d’informatique en Parallélisme, Architecture et Système), 2014.
[Bibtex]

@article{Beat:1856,
author = "Wolf, Beat and Kuonen, Pierre and Dandekar, Thomas",
title = "POP-Java : Parallélisme et distribution orienté objet:
Compas2014 (Conférence d'informatique en Parallélisme,
Architecture et Système) . Compas2014 (Conférence
d'informatique en Parallélisme, Architecture et Système) ",
month = "avr",
year = "2014",
journal = "Compas2014 (Conférence d'informatique en Parallélisme,
Architecture et Système) ",
}

B. Wolf and P. Kuonen, “A novel approach for heuristic pairwise DNA sequence alignment: The 2013 International Conference on Bioinformatics & Computational Biology (BIOCOMP’13),” The 2013 International Conference on Bioinformatics & Computational Biology (BIOCOMP’13), 2013.
[Bibtex]

@article{Pierre:1604,
author = "Wolf, Beat and Kuonen, Pierre",
title = "A novel approach for heuristic pairwise DNA sequence
alignment: The 2013 International Conference on
Bioinformatics & Computational Biology (BIOCOMP'13)",
month = "jul",
year = "2013",
journal = "The 2013 International Conference on Bioinformatics &
Computational Biology (BIOCOMP'13)",
issn = " ",
}