TY - JOUR
T1 - Optimal features selection in the high dimensional data based on robust technique
T2 - Application to different health database
AU - Hussain, Ibrar
AU - Qureshi, Moiz
AU - Ismail, Muhammad
AU - Iftikhar, Hasnain
AU - Zywiołek, Justyna
AU - López-Gonzales, Javier Linkolk
N1 - Publisher Copyright:
© 2024 The Authors
PY - 2024/9/15
Y1 - 2024/9/15
N2 - Bio-informatics and gene expression analysis face major hurdles when dealing with high-dimensional data, where the number of variables or genes much outweighs the number of samples. These difficulties are exacerbated, particularly in microarray data processing, by redundant genes that do not significantly contribute to the response variable. To address this issue, gene selection emerges as a feasible method for identifying the most important genes, hence reducing the generalization error of classification algorithms. This paper introduces a new hybrid approach for gene selection by combining the Signal-to-Noise Ratio (SNR) score with the robust Mood median test. The Mood median test is beneficial for reducing the impact of outliers in non-normal or skewed data since it may successfully identify genes with significant changes across groups. The SNR score measures the significance of a gene's classification by comparing the gap between class means and within-class variability. By integrating both of these approaches, the suggested approach aims to find genes that are significant for classification tasks. The major objective of this study is to evaluate the effectiveness of this combination approach in choosing the optimal genes. A significant P-value is consistently identified for each gene using the Mood median test and the SNR score. By dividing the SNR value of each gene by its significant P-value, the Md score is calculated. Genes with a high signal-to-noise ratio (SNR) have been considered favorable due to their minimal noise influence and significant classification importance. To verify the effectiveness of the selected genes, the study utilizes two dependable classification techniques: Random Forest and K-Nearest Neighbors (KNN). These algorithms were chosen due to their track record of successfully completing categorization-related tasks. The performance of the selected genes is evaluated using two metrics: error reduction and classification accuracy. These metrics offer an in-depth assessment of how well the selected genes improve classification accuracy and consistency. According to the findings, the hybrid approach put out here outperforms conventional gene selection methods in high-dimensional datasets and has lower classification error rates. There are considerable improvements in classification accuracy and error reduction when specific genes are exposed to the Random Forest and KNN classifiers. The outcomes demonstrate how this hybrid technique might be a helpful tool to improve gene selection processes in bioinformatics.
AB - Bio-informatics and gene expression analysis face major hurdles when dealing with high-dimensional data, where the number of variables or genes much outweighs the number of samples. These difficulties are exacerbated, particularly in microarray data processing, by redundant genes that do not significantly contribute to the response variable. To address this issue, gene selection emerges as a feasible method for identifying the most important genes, hence reducing the generalization error of classification algorithms. This paper introduces a new hybrid approach for gene selection by combining the Signal-to-Noise Ratio (SNR) score with the robust Mood median test. The Mood median test is beneficial for reducing the impact of outliers in non-normal or skewed data since it may successfully identify genes with significant changes across groups. The SNR score measures the significance of a gene's classification by comparing the gap between class means and within-class variability. By integrating both of these approaches, the suggested approach aims to find genes that are significant for classification tasks. The major objective of this study is to evaluate the effectiveness of this combination approach in choosing the optimal genes. A significant P-value is consistently identified for each gene using the Mood median test and the SNR score. By dividing the SNR value of each gene by its significant P-value, the Md score is calculated. Genes with a high signal-to-noise ratio (SNR) have been considered favorable due to their minimal noise influence and significant classification importance. To verify the effectiveness of the selected genes, the study utilizes two dependable classification techniques: Random Forest and K-Nearest Neighbors (KNN). These algorithms were chosen due to their track record of successfully completing categorization-related tasks. The performance of the selected genes is evaluated using two metrics: error reduction and classification accuracy. These metrics offer an in-depth assessment of how well the selected genes improve classification accuracy and consistency. According to the findings, the hybrid approach put out here outperforms conventional gene selection methods in high-dimensional datasets and has lower classification error rates. There are considerable improvements in classification accuracy and error reduction when specific genes are exposed to the Random Forest and KNN classifiers. The outcomes demonstrate how this hybrid technique might be a helpful tool to improve gene selection processes in bioinformatics.
KW - High-dimensional data
KW - Hybrid technique
KW - Machine learning models
KW - Mood median test
KW - Optimizing gene selection
KW - Single noise ratio score
UR - http://www.scopus.com/inward/record.url?scp=85203009281&partnerID=8YFLogxK
U2 - 10.1016/j.heliyon.2024.e37241
DO - 10.1016/j.heliyon.2024.e37241
M3 - Review article
AN - SCOPUS:85203009281
SN - 2405-8440
VL - 10
JO - Heliyon
JF - Heliyon
IS - 17
M1 - e37241
ER -