Objective

The main objective of the project is to develop an intuitive and cloud-based tool to support comprehensive ‘omics data analysis and visualization for ecological indicator species covering fish, birds, mammals, and invertebrates, as well as a technical guidance document to facilitate end-user uptake.

Motivation

The initial design of the project is to develop improved, and streamlined assembly and annotation pipelines by leveraging the current bioinformatics and computing resources to support transcriptomics (RNAseq) analysis of the 12 key ecological indicator species. The conventional RNA-seq workflow involves are very time consuming and resource-intensive, involving the use of multiple software tools to conduct raw reads quality checks, read error correction, transcriptome de novo assembly, transcriptome quality assessment, transcriptome annotation, and downstream analysis such as identification of DEGs and pathway enrichment analysis. Although the downstream statistical analysis is relatively straightforward, raw data processing remains a key obstacle. In particular, transcriptome de novo assembly is a complex, time-consuming task and requires extensive computational resources. Another key step in the conventional RNA-seq workflow is transcriptome annotation. The established procedure is to perform BLAST search using the assembly transcripts, as illustrated by the popular BLAST2GO pipeline.

After following the conventional approach for several species, we decided to develop a scalable and generic approach to enable large-scale efficient RNAseq data analysis in a largely species independent manner. Several superfast DNA-to-protein aligners, including DIAMOND, MMseq, and Kaiju have been developed to map DNA reads directly to microbial protein databases, thus skipping genome assembly and directly quantifying the functional capabilities of the sample's microbiome. Bacterial genomes are densely packed with protein-coding genes and free of introns, characteristics that are largely shared by eukaryotic RNA-seq reads. Thus, theoretically it should be possible to directly quantify the expression of protein-coding genes from eukaryotic RNA-seq reads using similar approaches and algorithms, although we are not aware of any existing tools that do this.

Design

For transcriptomics studies in nonmodel organisms that focus on mRNAs or protein-coding genes, we propose a new processing and analysis strategy of directly translating RNA-seq reads into all possible short amino acid (aa) sequences and then comparing these with protein references to identify their possible functional homologs. This concept is illustrated in the figure below.

Conceptual solution to bypass the computationally intensive steps of de novo assembly and annotation involved in non-model species RNAseq analysis