SeqScreen Accurate and Sensitive Functional Screening of Pathogenic Sequences

Abstract

Modern benchtop DNA synthesis techniques and increased concern of emerging pathogens have elevated the importance of screening oligonucleotides for pathogens of concern. However, accurate and sensitive characterization of oligonucleotides is an open challenge for many of the current techniques and ontology-based tools. To address this gap, we have developed a novel software tool, SeqScreen, that can accurately and sensitively characterize short DNA sequences using a set of curated Functions of Sequences of Concern (FunSoCs), novel functional labels specific to microbial pathogenesis which describe the pathogenic potential of individual proteins. SeqScreen uses ensemble machine learning models encompassing multi-stage Neural Networks and Support Vector Classifiers which can label query sequences with FunSoCs via an imbalanced multi-class and multi-label classification task with high accuracy. In summary, SeqScreen represents a first step towards a novel paradigm of functionally informed pathogen characterization from genomic and metagenomic datasets. SeqScreen is open-source and freely available for download at: www.gitlab.com/treangenlab/seqscreen

Date
Oct 26, 2020 1:00 PM — 1:15 PM
Location
Rice University
6100 Main St, Houston, TX 77005
Check out the SeqScreen preprint here
Dr. Advait Balaji
Dr. Advait Balaji
PhD student from 2018 through 2023 (currently Analytics Engineer at Oxy)

Advait (5th year PhD student) obtained a dual degree, B.E Computer Science and MS Biological Sciences from BITS, Pilani in India. During his undergraduate degree, he received the Khorana Scholarship (2016) from the Indo-US Science and Technology Forum and also a thesis fellowship (2017-18) to work at Icahn School of Medicine, Mount Sinai, NY. At Mount Sinai, he worked on creating a Sub-cellular process-based ontology that predicts whole cell function using Natural Language Processing. His research interests are at the intersection of genomic data science and designing efficient algorithms to analyze genomic data.

Next