Aftab Hussain
University of Houston

An ML Framework for Gene Subsequence Analysis

Aftab Hussain
University of Houston
calendar_clock June 2024


construction Skills used: Python, data manipulation, information retrieval, scikit-learn, numpy, SVM

arrow_backReturn to Projects

drawing

This exploratory project aims to develop an approach to investigate the relationship between gene subsequences and gene families. We apply this approach to a curated synthetic dataset consisting of 50,000 samples, each containing a pair of a gene sequence and its corresponding family. Our goal is to create a machine learning framework to determine if there are relationships between certain features of gene subsequences and the families of genes.

Specifically, we construct the feature Subs_k_N, which represents the number of subsequences in a gene sequence where the total number of nucleotides of type N is k. To compute this feature for each nucleotide, we utilize a hash-map-based prefix-sum approach with a complexity of O(n), where n is the number of characters in the given sample.

Our framework leverages multiple machine learning algorithms to thoroughly investigate this relationship, providing insights into the connections between gene subsequences and their families.


code_blocks Source code - Github

Image by rawpixel.com on Freepik