3 Experimental Design & Processing

Author

Adrien Osakwe

3.1 Experimental Design

The first step is to ensure that the proposed experimental design has the potential to answer the desired biological question. To achieve this, we need to make appropriate selections for

library type
sequencing depth
number of replicates

while also ensuring that the execution of the experiment does not acquire additional biases (ensuring case and control replicates are not in separate batches for example).

3.1.1 RNA-extraction protocol

The RNA-extraction protocol is necessary to remove highly abundant transcripts (usually ribosomal RNA which represents 90% of total RNA in the cell) which could cloud the underlying biological signal of interest (mRNA only represents 1-2% of transcripts). In eukaryotes, we can enrich our samples for mRNA via Poly(A) selection or rRNA depletion.

3.1.1.1 Poly(A) Selection

Poly(A) selection requires a relatively high proportion of mRNA to be present with minimal degradation. Degradation is measured by the RNA integrity number (RIN). We can expect this to lead to a higher proportion of sequenced reads overlapping known exons. However, most samples cannot be obtained in sufficient quantity or quality (RIN) to produce satisfactory Poly(A) libraries. As a result, ribosomal depletion is often used instead.

3.1.1.2 Strand-Preserving Libraries (REVIEW)

In cases where we want to account for the quantification of antisense and/or overlapping transcripts, standard methods such as Illumina’s random hexamer priming (1st Gen) are insufficient as they are missing sections of the DNA strand that are expressed. There exists many protocols which overcome this, a notable example being dUTP,which incoroporates UTP nucleotides prior to adapter ligation.

The size of the fragments generated by the library protocol are crucial determinants for adequate sequencing and analysis. Note that longer reads facilitate accurate alignment to a reference transcriptome and improve isoform identification.

3.1.2 Single-End vs. Paired-end (REVIEW - add diagram)

Paired-end reads are preferred for de novo assembly and isoform analysis
Single-end reads are cheaper and sufficient for well-annotated organisms to study gene expression levels
Paired-end is more reliable when working with poorly annotated transcriptomes.
single-end only reads the fragment once, paired-end reads it one way then back the other (higher accuracy for alignment and downstream tasks)

3.1.3 Library size

This represents the number of sequenced reads per sample. The greater the library size, the easier it will be to generate accurate measurements that include rare transcripts. The selected depth will depend on your experimental setup. If the expected expression changes are in highly-expressed genes, it may be sufficient to use around 5 million reads. However, if there is a need to identify changes in genes with low expression, it may be reasonable to sequence up to 100 million reads. In the case of single cells, quantification can be achieved with 1 million reads although only 50000 reads is sufficient for most highly expressed genes.

An optimal depth should account for the complexity of the studied transcriptome. There has been evidence that the added resolution of deeper sequencing can be accompanied by the detection of unwanted noise and off-target transcripts, including ambient RNA (CITATION MISSING HERE). This however can be assessed by using saturation curves (ADD DIAGRAM HERE).

3.1.4 Number of replicates

Beyond our choice of library protocol and size, we need to determine how many replicates to include. Our decision depends on the level of technical variability associated with our RNA-seq procedure, the biological variability in our system of interest and our desired statistical power. These can be determined by power analyses.

It will also be important to plan sequencing experiments, especially when samples must be sequenced in batches to help minimize the level of technical variability. Notable good practices are randomized sample processing and ensuring covariates such as sex, age, and case/control status are well mixed across batches.

in practice, three biological replicates is seen as the bare minimum for inference

To run a power analysis, a smaller experiment can be run to acquire estimates for within-group variance and gene expression levels. The exact power will depend on the software used for differential expression. Many of these packages provide estimates to users. If the method you use includes an FDR score (False-Discovery Rate), the proportion of highly expressed genes in your dataset will affect your power. In these cases, it may be preferable to remove lowly-expressed genes form your dataset beforehand. Alternatively, increasing the sequencing depth is possible. However, improvements in power from depth increases will saturate for any sample, at which point increasing the number of replicates is a better solution.

To this end, tools such as Scotty are available to discern the optimal trade-off between the number of replicates and sequencing depth such that power and experiment costs are optimized.

NOTE: ADD TABLE SHOWING POWER CALCULATIONS - consider an exercise

3.2 Analysis

Here, we will look at the standard steps required to go from raw sequencing reads to differential expression.

3.2.1 Quality Control Checkpoints

3.2.1.1 Raw Reads

Quality control of raw reads involves the following metrics/indicators:

sequence quality
GC content
Adapter presence
k-mer overrepresentation and duplicated reads (sequencing errors)
PCR artifacts
sample/RNA contamination

Satisfactory duplication, k-mer and GC content levels will vary between experiments and organisms. However, we expect samples to have similar estimates for each within an experiment. A rule-of-thumb is to discard samples over 30% disagreement. These analyses can be done by FASTQC for Illumina-based reads or NGSQC for any platform.

In general, sequence quality decreases as we approach the 3’ end of the sequence. If the quality drops too much, we can trim the 3’ of reads to increase mappability. This can be done by method such as FASTX-Toolkit and Trimmomatic.

3.2.1.2 Read Alignment

Reads can be mapped to a genome or transcriptome depending on your research question and the quality of the reference panel. A crucial quality metric is the percentage of mapped reads. This acts as an estimate of sequencing accuracy and of contamination. We expect 70-90% of reads to map to the human genome with a significant proportion of reads mapping to a small number of identical regions with similar quality (multi-mapping reads). When mapped to the transcriptome, we expect a slight decrease in percentage as reads from unannotated transcripts are lost and an increase in multi-mapping reads.

We also examine the uniformity of read coverage on exons and the mapped strand. Reads accumulating in the 3’ end of transcripts in a poly(A) selected sample may indicate low quality RNA. GC content of mapped reads can also reveal PCR biases (REVIEW). Tools such as Picard, RSeQC and Qualimap can be used for alignment.

3.2.1.3 Quantification

After transcript quantification, users should check GC content and gene length biases so the normalization methods can correct and concerns. If the reference transcriptome is well annotated, researchers can use biotype composition to further assess sample quality (e.g: we would expect low rRNA levels from a poly(A) protocol sample. Useful plots can be provided by NOISeq and EDA-seq in R.

3.2.1.4 Reproducibility

Reproducibility between replicates is essential to ensure the data can provide useful insights. We expect technical replicates to have a high reproducibility for gene expression (Spearman \(R^2 > 0.9\)). Reproducibility across biological replicates will depend on the expected heterogeneity, disease status and other covariates. A good solution here is to visualize PCAs and see if clusters form batch or covariate-specific clusters.

3.2.1.5 Transcript Identification

In the presence of a reference genome, RNA-seq reads will be mapped onto the genome or transcriptome, to determine what transcript they represent. Using a reference transcriptome prevents the discovery of novel transcripts and emphasizes quantification.

In the absence of a sequenced genome, we first need to assemble reads into contigs which represent the candidate transcriptome which we can map the reads to for quantification. Read coverage can be used to determine expression level in both cases. Transcript identification and quantification and be done jointly or sequentially.

3.2.1.6 Alignment

As mentioned previously, we have the choice between mapping to a reference genome or transcriptome. Both approaches often find reads that map to multiple regions due to repetitive sequences or shared protein domains across genes. These account for a large proportion of reads (even larger when using a transcriptome due to isoforms) and SHOULD NOT be discarded. In isoform-based analyses, proper transcript identification and quantification will require more rigour to generate reproducible insights.

3.2.1.7 Transcript Discovery

As short reads rarely span across splice junctions, they struggle to accurately infer full-length transcripts. It is possible to use tools such as GRIT which incorporate CAGE or RAMPAGE data to improve annotation ability. In general, paired-end reads and sequencing depth and replicates can greatly reduce false positive for transcript discovery. Methods that were adapted to this task for short read data include Cufflinks, iReckon, SLIDE, StringTie and Montebello.

An easier solution if isoforms or novel transcript discovery is desired is to use long read sequencing data such as SMRT, which will identify more accurate contigs and map to reference transcriptomes with greater ease.

3.2.1.8 De novo transcript reconstruction

If there is no reference genome available, then RNA reads must be assembled de novo. There exists many packages for this task including SOAPdenovo-Trans, Oases, Trans-AByS and Trinity. Paired-end strand-specific sequencing or long read sequencing are usually preferred for this task. Too low (miss lowly expressed genes) and too high (contamination risk and increased runtime) of a sequencing depth can be problematic for accurate inference. It is therefore recommended to reduce the number of reads in silico for deeply sequenced samples. In comparative analyses, it is important to pool all reads together to generate a single set of contigs.

3.2.1.9 Transcript quantification

This metric is usually defined as the number of reads mapping to a given transcript sequence. Programs such as HTSeq-count or featureCounts provide simple frameworks to aggregate raw counts. A gene-level quantification approach (focus on gene instead of individual isoforms) uses a gene transfer format (GTF) file containing genome coordinates of exons and genes and may have discarded multireads.

In practice, raw read counts are not ideal for analysis as they are biased by transcript length, total read number, and sequencing biases. To this end normalized metrics are used to facilitate fair comparisons across samples.

Reads per kilobase of exon model per million reads (RPKM)
- RPKM is a within-sample normalization method which removes transcript-length and library size effects
- Designed for single-end reads
Fragments per kilobase of exon model per million mapped reads (FPKM)
- FPKM is an extension of RPKM designed for paired-end reads
- Each pair of reads are treated as one fragment here (if they both were mapped). This avoids counting a fragment twice which cannot be done by RPKM.
- Renders the same output as RPKM on single-end reads.
Transcripts per million (TPM)
- TPM is an extension of RPKM where we first normalize by transcript length and then normalize by sequencing depth.
- The some of all TPMs is the same for each sample, making it easier to compare the proportion of reads mapped to a gene across samples.
- Can also convert FPKM counts into TPM

Note that the correction for transcript-length is crucial for within-sample comparison across genes and not for comparing a gene across samples (as the bias would affect all samples the same). Cufflinks however can identify cases where there are significant differences in gene length across samples that must be addressed.

Although TPM facilitates cross-sample comparisons, it still has biases which could be resolved by other normalization techniques.