RNA-Seq (RNA sequencing) is a sequencing technique which uses NGS (next-generation sequencing) to reveal the presence and quantity of RNA in a biological sample at a given moment, analyzing the continuously changing cellular transcriptome. It facilitates the ability to look at isoforms, post-transcriptional modifications, gene fusion, mutations/SNPs and changes in gene expression over time, or differences in gene expression in different groups or treatments.
RNA-Seq study includes various steps as mentioned below -
Step 1 : Pre-processing the raw reads from input files
Step 2 : Aligning the reads and Mapping on a reference genome
Step 3 : Calculating the abundance of reads aligned to the reference genome
Step 4 : Statistical and Biological Interpretation of the Expression Table
Below, you will see an overview of the different types of algorithms we learned about in the Transcriptomics 1 course for analysis of RNA-Seq data. Included with the descriptions, are color-coded hyperlinks to ancillary materials you may choose to look through in order to increase your understanding of the various methodologies.
Pre-Processing
It is advisable to prepare raw sequencing reads prior to basic bioinformatics analysis. Pre-processing of sequencing reads increases the likelihood of correct alignment and reduces expression biases due to technical artifacts. The Data Pre-Processing Section has several algorithms that can be used to clean raw data from technical additions made during library preparation. You can view a video animation of Next Generation Sequencing and learn about library preparation and the shotgun sequencing method here.
Start
RNA-Seq, also known as transcriptome sequencing, allows for the user to take a snapshot of RNA in a biological sample at a given time. Analysis of this kind of data can be used to study the ever-changing cellular transcriptome.
There are two basic types of RNA-seq that depend on the kit being used for library preparation: PolyA RNA-seq (mRNA sequencing) and total RNA-seq (no PolyA adaptors are used for library preparation; rRNAs are cleaned and then everything is sequenced - this kind of library consists of mRNAs and other long non-coding RNAs).
RNA-Seq analysis pipeline starts with the “Start” job, that compiles user selected data input options into a series of tags, and generates the correct pipeline options - reducing the number of possible algorithms to the ones that can handle the input data. A typical RNA-seq analysis includes the following basic steps:
-
Pre-processing - cleaning the data from technical artifacts
-
Mapping - Alignment on reference genome/transcriptome
-
Quantification of expression values for each gene/transcript in each sample, generating a table of gene or isoform expression.
The T-Bioinfo platform provides a user-friendly interface for analysis of RNA-seq data. The first step of analysis includes the upload of raw sequencing data and selection of reference genome/transcriptome. The second step of analysis includes selection of algorithms. The platform guides the user in choosing the appropriate algorithms by highlighting options that are possible for each step of analysis. The choice of algorithm will affect the the expression table results and this each algorithm provides explanation and reference for more information.
PCRclean
PCR Clean removes all duplicated reads from raw sequencing data. The presence of duplicated reads from polymerase chain reaction (PCR) amplification can distort estimates of gene expression levels. Input formats for the module are fastQ or fastA raw sequencing reads. After cleaning PCR duplicates, the output is given in the same format as input (fastQ or fastA). More info on PCR amplification
Trimmomatic
The Trimmomatic algorithm trims technical sequences (from a database which stores sequences known to be used as adaptors in NGS experiments) from raw sequencing data. As a result, cleaned up reads are stored in a FASTQ file that can be used for mapping. More info on Trimmomatic.
Mapping
Short reads generated by RNA-seq experiments must ultimately be aligned, or "mapped" to a reference genome, which is usually used as a proxy for transcriptome. The general objective when mapping or aligning a collection of sequencing reads to a reference is to discover the true location (origin) of each read with respect to that reference. Features of the reference as repetitive regions, assembly errors, missing information, sequencing errors, polymorphisms, and limited complexity in reads will make the reads harder to align. Thus, aligners must be flexible when applying mapping criteria. That is, they must allow for approximate matches. Otherwise, a large proportion of sequencing reads will not be assigned to a particular feature in the reference. There is no easy answer to which aligner one should choose or which alignment parameter values one should select for a particular aligner. Like many other steps in RNA-seq analysis, there really is no substitute for a little exploration.
Bowtie2-t
Bowtie2-t is a Bowtie2-based transcriptome alignment algorithm suitable for alignment of sequencing reads to reference transcripts. Reference transcripts are generated by utilizing the gtf file and the reference genome file. Bowtie2 is an alignment algorithm that is based on the “seed” (or k-mer) approach. “Seed” substrings from the read and their reverse complements are extracted and aligned to the reference in an ungapped fashion. Then, their positions on the reference are recorded, they are extended into full alignments using SIMD-accelerated dynamic programming. More info on Bowtie
TopHat2
TopHat2 is a junction-sensitive alignment algorithm. TopHat2 aligns reads to a reference genome using Bowtie2, a short read aligner that utilizes Burrows-Wheeler index. After the reads are aligned by Bowtie2, TopHat2 analyzes the mapping results to identify reads aligning to splice junctions between exons thus identifying the isoforms present in the dataset.
More info on TopHat2
To get a detailed understanding of what each step includes and the basis behind each algorithm being used, follow the lessons under Course 5: Transcriptomics on the OmicsLogic Portal.
For any queries, mail us at support@pine.bio