6  Single Cell Intro

Author

Adrien Osakwe

6.1 Single Cell Analyses

This section contains very basic examples of R code for Single cell analyses using 10X data.

Examples in this section were made using the OSCA e-book and data from the 10X Genomics Dataset Page. Pancreas datasets came from the Grun et al. and Muraro et al. papers.

6.2 1. Loading 10X Data into R

We can load 10X Data into R to generate a ‘SingleCellExperiment’ object. We can either

  • load the cellranger outputs directly

    set.seed(1234)
    library(DropletUtils)
    Loading required package: SingleCellExperiment
    Loading required package: SummarizedExperiment
    Loading required package: MatrixGenerics
    Loading required package: matrixStats
    Warning: package 'matrixStats' was built under R version 4.2.3
    
    Attaching package: 'MatrixGenerics'
    The following objects are masked from 'package:matrixStats':
    
        colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
        colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
        colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
        colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
        colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
        colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
        colWeightedMeans, colWeightedMedians, colWeightedSds,
        colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
        rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
        rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
        rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
        rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
        rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
        rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
        rowWeightedSds, rowWeightedVars
    Loading required package: GenomicRanges
    Loading required package: stats4
    Loading required package: BiocGenerics
    
    Attaching package: 'BiocGenerics'
    The following objects are masked from 'package:stats':
    
        IQR, mad, sd, var, xtabs
    The following objects are masked from 'package:base':
    
        anyDuplicated, aperm, append, as.data.frame, basename, cbind,
        colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
        get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
        match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
        Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort,
        table, tapply, union, unique, unsplit, which.max, which.min
    Loading required package: S4Vectors
    
    Attaching package: 'S4Vectors'
    The following objects are masked from 'package:base':
    
        expand.grid, I, unname
    Loading required package: IRanges
    
    Attaching package: 'IRanges'
    The following object is masked from 'package:grDevices':
    
        windows
    Loading required package: GenomeInfoDb
    Loading required package: Biobase
    Welcome to Bioconductor
    
        Vignettes contain introductory material; view with
        'browseVignettes()'. To cite Bioconductor, see
        'citation("Biobase")', and for packages 'citation("pkgname")'.
    
    Attaching package: 'Biobase'
    The following object is masked from 'package:MatrixGenerics':
    
        rowMedians
    The following objects are masked from 'package:matrixStats':
    
        anyMissing, rowMedians
    #Load raw cellranger counts
    #file.name <- "../Data/raw_gene_bc_matrices/GRCh38/"
    
    #You can also load the filtered counts (removes empty droplets)
    file.name <- "../Data/filtered_gene_bc_matrices/GRCh38/"
    
    sce <- read10xCounts(file.name)
    'as(<dgTMatrix>, "dgCMatrix")' is deprecated.
    Use 'as(., "CsparseMatrix")' instead.
    See help("Deprecated") and help("Matrix-deprecated").
    #read10xCounts takes a directory path and looks for the three key 10X outputs
    #barcodes.tsv, genes.tsv, matrix.mtx
    sce
    class: SingleCellExperiment 
    dim: 33694 4340 
    metadata(1): Samples
    assays(1): counts
    rownames(33694): ENSG00000243485 ENSG00000237613 ... ENSG00000277475
      ENSG00000268674
    rowData names(2): ID Symbol
    colnames: NULL
    colData names(2): Sample Barcode
    reducedDimNames(0):
    mainExpName: NULL
    altExpNames(0):
  • load a pre-existing .rds file that contains a SingleCellExperiment object.

    sce <- readRDS("../Data/sce.rds")
library(scater)
Loading required package: scuttle
Loading required package: ggplot2
Warning: package 'ggplot2' was built under R version 4.2.3
rownames(sce) <- uniquifyFeatureNames(
    rowData(sce)$ID, rowData(sce)$Symbol)


#Load Gene chromosome locations
location <- readRDS("../Data/gene_locations.rds")
#Generated with this code
# library(EnsDb.Hsapiens.v75)
# location <- mapIds(EnsDb.Hsapiens.v75, keys=rowData(sce)$ID,
#     column="SEQNAME", keytype="GENEID")
## The SingleCellExperiment Object
sce
class: SingleCellExperiment 
dim: 33694 4340 
metadata(1): Samples
assays(1): counts
rownames(33694): RP11-34P13.3 FAM138A ... AC213203.1 FAM231B
rowData names(2): ID Symbol
colnames: NULL
colData names(2): Sample Barcode
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
#Look at feature Data
rowData(sce)
DataFrame with 33694 rows and 2 columns
                          ID       Symbol
                 <character>  <character>
RP11-34P13.3 ENSG00000243485 RP11-34P13.3
FAM138A      ENSG00000237613      FAM138A
OR4F5        ENSG00000186092        OR4F5
RP11-34P13.7 ENSG00000238009 RP11-34P13.7
RP11-34P13.8 ENSG00000239945 RP11-34P13.8
...                      ...          ...
AC233755.2   ENSG00000277856   AC233755.2
AC233755.1   ENSG00000275063   AC233755.1
AC240274.1   ENSG00000271254   AC240274.1
AC213203.1   ENSG00000277475   AC213203.1
FAM231B      ENSG00000268674      FAM231B
#Look at metadata
colData(sce)
DataFrame with 4340 rows and 2 columns
                     Sample            Barcode
                <character>        <character>
1    ../Data/filtered_gen.. AAACCTGAGAAGGCCT-1
2    ../Data/filtered_gen.. AAACCTGAGACAGACC-1
3    ../Data/filtered_gen.. AAACCTGAGATAGTCA-1
4    ../Data/filtered_gen.. AAACCTGAGCGCCTCA-1
5    ../Data/filtered_gen.. AAACCTGAGGCATGGT-1
...                     ...                ...
4336 ../Data/filtered_gen.. TTTGGTTTCGCTAGCG-1
4337 ../Data/filtered_gen.. TTTGTCACACTTAACG-1
4338 ../Data/filtered_gen.. TTTGTCACAGGTCCAC-1
4339 ../Data/filtered_gen.. TTTGTCAGTTAAGACA-1
4340 ../Data/filtered_gen.. TTTGTCATCCCAAGAT-1
#Check Dimensions
nrow(sce)
[1] 33694
ncol(sce)
[1] 4340
dim(sce)
[1] 33694  4340

6.3 2. Basic Quality Control

Here, we loaded the raw, unfiltered 10X counts which require us to do some cleaning (particularly for empty droplets!).

In the quality control of single cell data, we usually consider the following metrics:

  • % of mitochondrial genes

  • # of non-zero features per cell

  • total read count per cell

We also want to filter out the following:

  • empty droplets

  • doublet cells (scDblFinder)

In practice, you may want to also consider some other features.

6.3.1 Generating QC Metrics

set.seed(1234)
#Use below IF using raw cellranger output
# e.out <- emptyDrops(sce)
# sce <- sce[,which(e.out$FDR <= 0.001)]
unfiltered <- sce

stats <- perCellQCMetrics(sce, subsets=list(Mito=which(location=="MT")))

high.mito <- isOutlier(stats$subsets_Mito_percent, type="higher")
sce <- sce[,!high.mito]

summary(high.mito)
   Mode   FALSE    TRUE 
logical    4175     165 

6.3.2 Visualizing

colData(unfiltered) <- cbind(colData(unfiltered), stats)
unfiltered$discard <- high.mito

gridExtra::grid.arrange(
    plotColData(unfiltered, y="sum", colour_by="discard") +
        scale_y_log10() + ggtitle("Total count"),
    plotColData(unfiltered, y="detected", colour_by="discard") +
        scale_y_log10() + ggtitle("Detected features"),
    plotColData(unfiltered, y="subsets_Mito_percent",
        colour_by="discard") + ggtitle("Mito percent"),
    ncol=2
)

6.3.3 Additional Filtering

Based on the results, we can use the other QC metrics for additional filteirng

stats
DataFrame with 4340 rows and 6 columns
           sum  detected subsets_Mito_sum subsets_Mito_detected
     <numeric> <integer>        <numeric>             <integer>
1         1738       748              111                    11
2         3240      1052              177                    12
3         1683       739              124                    11
4         2319       875               89                    11
5         2983       951               67                    11
...        ...       ...              ...                   ...
4336      6514      1809              304                    11
4337      3293      1250              140                    11
4338      8322      2226              180                    12
4339      2933       984               56                    11
4340      3322      1212              180                    11
     subsets_Mito_percent     total
                <numeric> <numeric>
1                 6.38665      1738
2                 5.46296      3240
3                 7.36780      1683
4                 3.83786      2319
5                 2.24606      2983
...                   ...       ...
4336              4.66687      6514
4337              4.25144      3293
4338              2.16294      8322
4339              1.90931      2933
4340              5.41842      3322
#Practice Add metadata column that says which cells have at least 1500 detected genes


#Practice Add metadata column that says which cells have at least 10 detected mitochondrial genes

6.3.4 Normalization

library(scran)
set.seed(1234)
clusters <- quickCluster(sce) #Simple way to generate clusters
sce <- computeSumFactors(sce, cluster=clusters)
sce <- logNormCounts(sce)


dec <- modelGeneVar(sce)
top_hvgs <- getTopHVGs(dec,n = 2000)

6.4 3. Dimension Reduction & Clustering

Dimension reduction is extremely useful for visualizing single cell data. Although PCA is still viable, it does not perform as well as non-linear methods such as t-SNE and UMAP. Do note that although UMAP is currently the tool of choice, there is currently a big debate on the benefits of using such methods in analysis (I suggest looking up Lior Pachter’s work to learn more about this).

6.4.1 t-SNE

set.seed(1234)
sce <- denoisePCA(sce, subset.row=top_hvgs, technical=dec)


sce <- runTSNE(sce, dimred="PCA")
plotTSNE(sce, colour_by="Sample")

6.4.2 UMAP

set.seed(1234)

sce <- runUMAP(sce, dimred="PCA")
plotUMAP(sce, colour_by="Sample")

6.4.3 K-means Clustering

set.seed(1234)
g <- buildSNNGraph(sce, k=10, use.dimred = 'PCA')
clust <- igraph::cluster_walktrap(g)$membership
colLabels(sce) <- factor(clust)

plotTSNE(sce, colour_by="label")

plotUMAP(sce, colour_by="label")

6.5 4. Cell Type Annotation

With our generated UMAP coordinates and clusters, we can undertake the task of cell annotation. It is possible to do this manually and automatically. In practice, I suggest to combine both approaches to ensure your annotations are as accurate as possible. In this section we will use pancreas samples.

6.5.1 Automated Annotation

A key advantage of automated annotation is that it usually (depending on the method) provides an annotation for individual cells. This is quite useful for scoring individual cells and identifying sub cell types.

#Single R
library(SingleR)
sceM <- readRDS("../Data/ref_pancreas.rds")
sceG <- readRDS("../Data/pancreas_sce.rds")
sceM <- logNormCounts(sceM)
sceG <- logNormCounts(sceG) 

pred.grun <- SingleR(test=sceG, ref=sceM, labels=sceM$label, de.method="wilcox")
Warning: useNames = NA is deprecated. Instead, specify either useNames = TRUE
or useNames = TRUE.
table(pred.grun$labels)

     acinar       alpha        beta       delta        duct endothelial 
        657         245         276          57         367          34 
    epsilon mesenchymal          pp     unclear 
          1          41          35           5 
sceG$pred <- pred.grun$labels


decG <- modelGeneVar(sceG)
top_hvgsG <- getTopHVGs(decG,n = 2000)
sceG <- denoisePCA(sceG, subset.row=top_hvgsG, technical=decG)
sceG <- runUMAP(sceG, dimred="PCA")
plotUMAP(sceG, colour_by="pred")

6.5.2 Manual Annotation

An issue with labeling individual cells is that their annotation is more susceptible to noise. To this end, manual annotations which are usually done at the level of clusters may be more reliable.

#Generate Clusters
set.seed(1234)
g <- buildSNNGraph(sceG, k=7, use.dimred = 'PCA')
Warning in (function (to_check, X, clust_centers, clust_info, dtype, nn, :
detected tied distances to neighbors, see ?'BiocNeighbors-ties'
clust <- igraph::cluster_walktrap(g)$membership
colLabels(sceG) <- factor(clust)
plotUMAP(sceG, colour_by="label")

rownames(sceG) <- rowData(sceG)$symbol
plotExpression(sceG, features=c("INS-IGF2", "GCG",
    "SST", "PRSS1"), x="label", colour_by="label")

plotUMAP(sceG, colour_by="INS")

6.5.3 Considerations for DGE during cluster annotation

Previously, many researchers would simply run a DGE analysis on different clusters to determine cell types. However, this has been proven to be a case of ‘double-dipping’ data which is a serious concern for reproducibility. Although this approach can work fine for very distinct cell types, it is not recommended for the analysis of more similar cell types of cell subtypes.

markers <- findMarkers(sceG, pval.type="some", direction="up")
marker.set <- markers[["4"]]
as.data.frame(marker.set[1:30,1:3])
               p.value          FDR summary.logFC
INS-IGF2  1.955456e-86 3.923426e-82      7.000224
SCG5      1.587739e-73 1.592820e-69      2.823960
BEX1      1.065568e-67 7.126519e-64      1.922672
SCGN      3.693829e-66 1.852825e-62      2.842336
NPTX2     3.397404e-65 1.363310e-61      2.812254
ABCC8     4.404209e-61 1.472768e-57      2.553605
HADH      2.092684e-59 5.998231e-56      1.783635
SYT13     7.704800e-57 1.932364e-53      1.653311
PTPRN     6.749016e-53 1.504581e-49      1.659602
MAFA      3.483224e-51 6.988741e-48      2.145823
UCHL1     1.490403e-49 2.718495e-46      2.234817
INSM1     2.587066e-49 4.325574e-46      1.825767
INS       3.434256e-49 5.300379e-46      7.463628
PAX6      1.068552e-48 1.531387e-45      1.737993
SCG2      1.309912e-47 1.752138e-44      1.865331
SCD       1.861938e-46 2.334870e-43      2.411018
ERO1LB    8.187010e-46 9.662598e-43      2.173745
PCSK1     9.052241e-46 1.009023e-42      1.747699
G6PC2     2.134745e-45 2.254290e-42      2.127562
SCG3      8.925831e-43 8.954393e-40      1.420783
RAP1GAP2  2.949814e-41 2.818336e-38      1.205589
NEUROD1   6.000074e-41 5.472068e-38      1.529432
CHGA      1.064801e-40 9.288771e-38      1.769868
LOC728392 1.871375e-40 1.564470e-37      1.277637
CPE       3.019806e-40 2.423575e-37      3.340525
PCSK1N    4.668968e-40 3.603007e-37      1.349134
SREBF1    1.717845e-39 1.276550e-36      1.325062
SLC30A8   1.087608e-38 7.793488e-36      1.919711
ADCYAP1   2.869340e-38 1.985187e-35      1.182864
SORL1     1.352898e-37 9.048182e-35      1.187146