Breast cancer cases can be subdivided into distinct subtypes, and for the analyzed breast cancer cell lines Daemen et al. specified the associated subtype. In total, 52 analyzed cell lines include 27 luminal, 14 basal, 6 claudin-low, and 5 normal-like cases (CellLines_ClinSubtypes.txt). Let us consider the following questions:
- Can we identify the subtype for a given sample ourselves, based on its available features (e.g., gene expression)?
- If it is possible, can we additionally find features (genes) that are most informative for this identification?
Answers to these questions can be provided by supervised machine learning methods, specifically, classification algorithms.
In this course we will test one of them, namely Linear Discriminant Analysis (LDA). LDA is similar to Principal Component Analysis (PCA), but takes class (subtype) information into account when determining components. Similarly to the other classification methods, LDA first takes a so-called training set as an input. For the training set, classes (subtypes) are known and specified. Based on this training set LDA constructs a classifier that is applied to new samples and attributes them to one of the classes.
Let us check what LDA provides in our case for the gene expression data. In order to do this we combine gene expression data and information about subtypes into one table and use it as a “training set” input. Note that in this table each sample (cell line) is associated with a row, not a column, and sample names are replaced by class labels. We use the following association:
1 – Basal
2 – Claudin-low
3 – Luminal
4 – Normal-like
Also note that for the T-BioInfo Platform, the file name for the training set should end with “_train.txt”. Transposing the data table (to transform columns into rows and vice versa) can be simply done in Excel by selecting, copying, and special pasting.
(on Mac, choose the “transpose” option). Surely, there are many other options (e.g., using the t command of R).
The training set that we will use for this example is linked here: CellLines_LDA_expr_train.txt.
Additionally we need to provide a similar table, but without class labels, as a test set. LDA will assign a class label to each sample in this table. A file name for this table should end with “_test.txt.” Here is the file that we will use: CellLines_LDA_expr_test.txt.
In research these methods are generally used together, one after another. For example, in case of subtype identification initial studies reveal features that are potentially important for a pathology. Then, based on these features, clustering methods identify the existence of more or less distinct groups representing subtypes. Finally, classification methods build classifiers that allow identification of a subtype for new samples. Furthermore, knowing subtypes we can refine a list of “important features” and, based on this list, refine division into subtypes itself.
To learn more about supervised machine learning analysis and get hands on with the data using different algorithms, visit: https://learn.omicslogic.com/Learn/project-05-modeling-cancer-precision-medicine/lesson/03-supervised-machine-learning-analysis