Feature Selection Methods: swLDA and RF
Feature Selection starts with testing all individual features (i.e., genes) and selects the one that provides the best classification quality (for the training set). Then, it tests all pairs where the first feature is the one selected at the previous step, and again it selects the pair that provides the best classification quality.
The process goes further to triples, quadruples, etc. Such a greedy strategy is not generally optimal, but it provides the result in an acceptable time. Greedy algorithms find a locally “optimal” solution rather than an overall optimal solution. That is to say, they build the solution piece by piece, finding an immediate optimal solution and then adding to that (and finding that immediate optimal solution, etc.).
Stepwise Linear Discriminant Analysis (swLDA) is used to find a subset of the provided features that optimally separates the classes inherent in the data. In this procedure, a discriminant model is built iteratively. Starting with an empty set of features, LDA classification is evaluated on each feature and the feature resulting in the highest accuracy is added to the optimal feature set.
Random Forest can also be used to evaluate features. The more a “feature” (or in our case a gene) appears in these decision trees, the better it “performs”. Going through many trees like that, we can evaluate feature significance. This is an important point, because Random Forest can help us identify those features that are most useful for classification accuracy and impurity.
For stepwise instructions on performing the pipeline on the T-Bioinfo Server and to understand the theoretical concept behind the algorithms used, visit Lesson 15: Transcriptomics on OmicsLogic Learn Portal :
https://learn.omicslogic.com/Learn/course-5-transcriptomics/lesson/15-t3-supervised-machine-learning-feature-selection. In this lesson, you will learn about feature selection methods based on supervised machine learning classifiers.