Projects and Tutorials

Transcriptomics
Document Video

RNA seq on T-BioInfo

User Ratings :

RNA-Seq is a technique that performs analysis of transcriptome data generated by next-generation sequencing technologies or by microarrays. Success in analysis of the transcriptome is largely dependent on bioinformatics tools developed to support the different steps in the process. 

The RNA-seq section of T-BioInfo provides a flexible approach to analysis of transcriptome data with a number of known and new algorithms ("modules") included and specially designed analysis features. 

The analysis pipelines go across the twelve different functional sections (analysis stages) found on the interactive graph, which will process your data from start to finish by utilizing the section specific   algorithms (modules). Starting from left to right these sections are: 

 

1. Data Pre-Processing: cleaning the primers in raw reads and format transfer; Result: cleaned NGS data or array data represented as NGS pseudo-reads.

2. Data Simulation: expression of isoforms of genes is simulated; Result: artificial NGS data which introduces errors representing expression of pre-defined splice variants.

3. Error Correction: correction of sequencing errors: Result: about 75% of the sequencing errors will be corrected

4. Mapping on Genome or Genes: alignment of reads against reference genome or mRNAs; Result: alignments of reads against references

5. Exon Detection: detection of expected exons in the reference genome; Result: GTF file that annotates predicted exons in the genome.

Exon Detection

Exon detection is detection of expected exons in the reference genome. The process outputs a GTF file that annotates predicted exons in the genome. Detection of putative new exons goes via three approaches:(1) detection of acceptor/donor sites, (2) exon as enrichment of mapping, (3) combined approach. More info on exon detection.

AUGUSTUS

AUGUSTUS is an exon prediction algorithm based on Generalized Hidden Markov Model (GHMM).  The algorithm does not rely on alignment results, but uses genomic markers (such as ORF and others) to determine exons on the reference. This algorithm can be used to generate a list of all exons present in the raw reads data. The newly generated list of exons can then be utilized to identify transcripts by first aligning all reads, utilizing a junction-sensitive algorithm, and then assembling the exons into transcript with the CuffLinks algorithm. 

More info on Augustus

JBrowse Visualization

JBrowse is an open source Genome Browser software that was designed for efficient handling of large scale sequencing data in a visual explorer. The JBrowse software was adapted to the online T-BioInfo environment and handles the output from bowtie 2-G (in BAM or SAM formats).

Learn more: http://jbrowse.org

 

More info on genome mapping.

6. Mapping on exon junctions: how exons are linked in isoforms according to NGS data; Result: alignments of reads against exon junctions.

Mapping on Junctions

Next-generation sequencing technologies enable rapid and cheap genome-wide transcriptome analysis, providing vital information about gene structure, transcript expression, and alternative splicing. Key to this is the accurate identification of exon-exon junctions from RNA sequenced (RNA-seq) reads. A number of RNA-seq aligners capable of splitting reads across these splice junctions (SJs) have been developed; however, it has been shown that while they correctly identify most genuine SJs available in a given sample, they also often produce large numbers of incorrect SJs.

More info on mapping on junctions.

7. Isoform Construction: splice variants are generated based on found exon junctions; Result: GTF file that annotates the predicted splice variants 

8. GTF file processing: merging different annotations of the genome; Result: balanced annotation of the genome based on several NGS data sets.

9. Mapping Statistics: selection of the correct mapping for a read; Result: posterior probability for a read to be generated by a specific genome site

10. Expression Table: Generation of expression values for genes and isoforms: Result: table of expressions across genes and isoforms 

11. Differential Expression: differential expression according to predefined contrasts between biological conditions; Result: up and down regulation of genes

12. Mining analysis results: machine learning methods and integration of results for several parallel analysis pipelines; Result: compression of results and comparison of parallel analyses.

When TopHat is not the choice

TopHat is not used when mapping on a genome that lacks an annotation file or a genome that is not well defined/annotated. An alternative in this situation is a BS-based pipeline.

The BS-based pipeline first maps on the genome using Bowtie-2. After the mapping is complete, a segmentation-based algorithm is used to identify the regions of the genome that are enriched in reads (i.e., regions of high expression). Next, using mergeBS, those BS regions that are close to one another are merged, and then the BS update will update the annotation files. Based on these, ExprT will generate expression tables for enriched regions of the genome. From these regions, we need to extract the sequences that are enriched in reads using seqBS. Lastly, these are annotated using BlastX and expression tables of all of the enriched regions that have been found by the BS algorithm provided to the user.

The above pipeline includes:

  • Bowtie2-G: for mapping on a genome

  • BS: for identifying regions of the genome that are enriched in reads

  • mergeBS: for merging nearby BS regions into a single region

  • updateGTFbyBS: to update/generate an “annotation” file based on the BS results

  • exprT: to generate an expression table for enriched regions of the genome

  • seqBS: for extracting sequences of regions enriched in reads

The output will be an expression table of all of the enriched regions found by the BS algorithm, including sequences of enriched regions and annotation of enriched regions.

 

 

 


 

 

HiSat2

 

HiSat2 (hierarchical indexing for spliced alignment of transcripts) is an alignment algorithm alternative to TopHat2. Similarly to TopHat2, it aligns reads to the reference genome and builds a map of junction reads for assembling transcripts. HiSat2 uses an indexing scheme based on the Burrows-Wheeler transform and the Ferragina-Manzini (FM) index, employing two types of indexes for alignment: a whole-genome FM index to anchor each alignment and numerous local FM indexes for very rapid extensions of these alignments. More info on HiSat2

To get hands-on for each of the analytical steps, follow transcriptomics lessons on OmicsLogic Portal