Projects and Tutorials

Statistical Analysis
Document Video

Statistical analysis: T-Test in Excel to find the differences between two groups

User Ratings :

The way we operate with data has been changed by computers to the point that they perform certain calculations we often do not completely understand. For example, the p-value that is often used to determine statistical significance has been misunderstood by many of those that use it regularly. The p-value and statistical significance will be terms we have to rely on in our analysis of gene expression data. Let’s refresh our memory about these terms before we move on. This video speaks about statistical analysis of gene and isoform expression. You will learn how to perform T-Test in Excel and identify differentially expressed genes.

Isoform Construction 

Alternative splicing (AS) is a post-transcriptional regulation mechanism that allows a single gene to produce multiple mRNA transcripts (isoforms). Some of the roles of AS include regulating gene expression in response to environmental stimuli and developmental changes. In addition to contributing to protein diversity and regulation, some variants of AS may be nonfunctional and quickly degraded, providing it gives cells another mechanism to regulate gene expression after transcription, but before translation. AS occurs as a normal phenomenon in eukaryotes, and is more abundant in higher eukaryotes than in lower eukaryotes. More than 95% of human genes and 60% of Drosophila multi-exon genes are alternatively spliced. In plants, 61% of intron-containing genes undergo alternative splicing.

The next step is to detect isoforms as paths across links provided by reads. We have two options - to look for exon junctions or restore isoforms via pair-read links. Once the Isoforms are constructed, we can now proceed to measure their expressions.

More info on isoform construction.

IsoLasso

IsoLasso: A LASSO Regression Approach to transcriptome assembly based on RNA-Seq data, which aims at reconstructing all full-length mRNA transcripts simultaneously from millions of short reads.

IsoLasso is an RNA-Seq based transcriptome assembly tool, based on the well-known LASSO algorithm, a multivariate regression method designated to seek a balance between the maximization of prediction accuracy and the minimization of interpretation. By including some additional constraints in the quadratic program involved in LASSO, IsoLasso is able to make the set of assembled transcripts as complete as possible. This algorithm is addressing the following three objectives in transcriptome assembly: the maximization of prediction accuracy, minimization of interpretation, and maximization of completeness. The first objective, the maximization of prediction accuracy, requires that the estimated expression levels based on assembled transcripts should be as close as possible to the observed ones for every expressed region of the genome. The minimization of interpretation follows the parsimony principle to seek as few transcripts in the prediction as possible. The third objective, the maximization of complete- ness, requires that the maximum number of mapped reads (or "expressed segments" in gene models) be explained by (i.e., contained in) the predicted transcripts in the solution.

read more about IsoLasso

Cufflinks

Cufflinks uses genome alignment map and junction reads identified by TopHat2 (or other junction sensitive alignment algorithm) to assemble transcripts.

More info on Cufflinks

CuffMerge

CuffMerge generates a new gtf file, combining the original gtf with newly assembled transcripts. 

 

More info on CuffMerge

 

To learn about the further parts, you can refer to the “Statistical Tests” course on OmicsLogic portal: https://learn.omicslogic.com/Learn/course-5-transcriptomics/lesson/08-t2-statistical-tests

that speaks about various adjustments to the way this value is calculated to avoid true and false positives. Also in this lesson, we will learn about various statistical tests that can be applied on gene expression data to derive biological significance from the data.