Gene Ontology GAN (GOGAN): a novel architecture for protein function prediction
Mansoor, Musadaq ; Nauman, Mohammad ; Rehman, Hafeez Ur ; Benso, Alfredo
Mansoor, Musadaq
Nauman, Mohammad
Rehman, Hafeez Ur
Benso, Alfredo
Citations
Altmetric:
Type
Supervisor
Subject
Date
2022-08-01
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
One of the most important aspects for a deep interpretation of molecular biology is the precise annotation of protein functions. An overwhelming majority of proteins, across species, do not have sufficient supplementary information available, which causes them to stay uncharacterized. Contrastingly, all known proteins have one key piece of information available: their amino acid sequence. Therefore, for a wider applicability of algorithms, across different species proteins, researchers are motivated to make computational techniques that characterize proteins using their amino acid sequence. However, in case of computational techniques like deep learning algorithms, huge amount of labeled information is required to produce good results. The labeling process of data is time and resource consuming making labeled data scarce. Utilizing the characteristic to address the formerly mentioned issues of uncharacterized proteins and traditional deep learning algorithms, we propose a model called GOGAN, that operates on the amino acid sequence of a protein to predict its functions. Our proposed GOGAN model does not require any handcrafted features, rather it extracts automatically, all the required information from the input sequence. GOGAN model extracts features from the massively large unlabeled protein datasets. The term “Unlabeled data” is used for piece of information that have not been assigned labels to identify their characteristics or properties. The features extracted by GOGAN model can be utilized in other applications like gene variation analysis, gene expression analysis and gene regulation network detection. The proposed model is benchmarked on the Homo sapiens protein dataset extracted from the UniProt database. Experimental results show clear improvements in different evaluation metrics when compared with other methods. Overall, GOGAN achieves an F1 score of 72.1% with Hamming loss of 9.5%, using only the amino acid sequences of protein.