GenoML Genomic variant interpretation with machine learning
Improve DNA sequence analysis using machine learning to predict sequencing artefacts. In this feasibility study we explored the possibility to train a machine learning model to detect artefacts originating from either sequencing or alignment errors. We compared our approach with existing methods such as DeepVariant and GATK CNN and achieved similar results, but with an analysis speed that was much faster.
For this project we created an automated pipeline to create artificial NGS data with known errors using NEAT-genreads and aligned them against the human reference genome. This allowed us to show the ability to train a machine learning model against specific data and compare it against the known errors that were injected. In the end we compared our method on the well known HG001 (NA12878) dataset, which allowed us to verify the transferability between synthetic and real data.
The project was done as part of an Innocheque with Phenosystems SA.