Andrew Adams, BMI 5730, Lab 5: scRNA-seq

Single cell RNAseq Homework

library(Seurat)
library(SeuratData)
library(patchwork)
library(ifnb.SeuratData)
library(tidyverse)
library(presto)

immune.combined<-readRDS("integration_tutorial.rds")
Idents(immune.combined)<-immune.combined$seurat_annotations
# Do NOT integrate the object again! 
# we set the default assay to RNA for differential expression
DefaultAssay(immune.combined) <- "RNA"
# We scale all the genes in the dataset so we don't get an error when plotting heatmap in question #4
# First time, we only scaled the top 2000 most variable genes
# This time we scale them all
immune.combined <- ScaleData(immune.combined, features = rownames(immune.combined))
set.seed(1234)

1. Plot UMAP

Plot dimplot of UMAP clusters. Then plot dimplot of UMAP clusters split by sample.

#Calculate UMAP clusters for integrated data
immune.combined <- FindNeighbors(immune.combined, reduction = "integrated.cca", dims = 1:30)
immune.combined <- FindClusters(immune.combined, resolution = 1, verbose=FALSE)
immune.combined <- RunUMAP(immune.combined,dims = 1:30,reduction = 'integrated.cca')

# Plot dimplots of UMAP clusters
DimPlot(immune.combined, reduction = 'umap', group.by=c('stim','seurat_annotations'))

DimPlot(immune.combined, reduction = "umap", split.by = "stim", group.by="seurat_annotations")

2. Number of cells

Make a table of the number of cells in each sample. Make another table of the number of cells in each annotated cluster (seurat_annotations) split by sample. Comment on what you can observe from this data regarding the proportion of cell types per sample. Plot it in excel (easiest!) or R to help you think about the proportions.

# Number of cells in each sample
table(immune.combined$stim)

## 
## CTRL STIM 
## 6548 7451

# Number of cells in each cell type, per sample
cells_per_sample<-table(immune.combined$seurat_annotations, immune.combined$stim)
cells_per_sample

##               
##                CTRL STIM
##   CD14 Mono    2215 2147
##   CD4 Naive T   978 1526
##   CD4 Memory T  859  903
##   CD16 Mono     507  537
##   B             407  571
##   CD8 T         352  462
##   T activated   300  333
##   NK            298  321
##   DC            258  214
##   B Activated   185  203
##   Mk            115  121
##   pDC            51   81
##   Eryth          23   32

# Plot to examine proportions
plot(cells_per_sample)

From examining the numbers in our table, we can see that for many cell types, each cell type has a similar proportion within each sample; that is, there are about the same number of cells in both the STIM and CTRL groups, although on average the STIM samples have more. However, there is some variability, with the largest differences (e.g., “CD4 Naive T”) having up to ~1.6 times more cells in the STIM group.

Plotting our table of cell type counts per sample helps us see this. In the mosaic plot, the height of each rectangle in a column (corresponding to one cell type) represents the proportion of that cell type for that sample; we can see that for each cell type, about half of its height is in each sample, with some variation, with the most uneven distributions having about 1/3 in the CTRL sample and 2/3 in the STIM sample.

The mosaic plot also shows us the relative amount of each cell type in the samples, represented by the width of the columns, with wider columns having more cells. The most plentiful cell type, “CD14 Mono”, has over 2000 cells per sample, while the least plentiful, “Eryth”, has only about 20 to 30 cells per sample.

3. Cluster markers

Identify the genes most highly expressed in each cluster. Set logfc.threshold to 0.25 and only.pos should be true. Explain why only.pos is set to true.

Use the code from the pbmc tutorial to plot the top 5 genes per cluster in a heatmap. What does this heatmap tell us? I know the cluster labels and gene names will be squished and that’s ok!

DefaultAssay(immune.combined) <- "RNA"

Idents(immune.combined) <- "seurat_annotations"
immune.combined.markers <- FindAllMarkers(immune.combined, logfc.threshold=0.25, verbose=FALSE, only.pos=TRUE)

#head(immune.combined.markers)

We set “only.pos=TRUE” because we only want to examine differentially expressed genes with positive fold change; that is, genes more highly expressed in the cluster in question compared to the other clusters. More highly expressed or upregulated genes are ones which are more “active” in the process in question; because their gene products are transcribed more highly, it implies that gene is actively involved in the condition being studied. This is more informative for us, and therefore we are more interested in these positive fold changes and want to examine them alone.

# Identify top 5 genes in each cluster
immune.combined.markers %>%
    group_by(cluster) %>%
    slice_head(n = 5) %>%
    ungroup() -> top5
DoHeatmap(immune.combined, features = top5$gene) + NoLegend()

The heatmap tells us the relative expression across all clusters for the genes we have identified as the top 5 genes within each cluster, with higher expression represented by brighter colors and lower expression by darker colors. We can see that for most of the clusters, the brightest band for that cluster corresponds to the genes identified as the top 5 most highly expressed for that cluster; and for most of the clusters, this band is unique to that cluster, implying that those top 5 highly expressed genes are not similarly highly expressed in the other clusters. Thus we can consider those genes to be defining for that cluster. Clusters which appear to be exceptions are: CD4 Memory T, as its band of top 5 genes appears to have similar expression levels in another cluster (CD4 Native T); and CD8 T and NK, which both show a similar band of expression for the top 5 genes identified in CD8 T (with NK also showing a band for its own top 5 genes, thus displaying two distinct bands of similar intensity).

4. Violin and Featureplot

Pick 4 cluster defining genes from your results in question #2 and plot a shaded featurePlot for each. Plot a violin plots of the same genes. First group by cell type and the split by sample and group by cell type. Check out the metadata to check what the cell type and sample names are called.

What do these plot tell you? Is it what you expected? Does there seem to be a sample difference in the expression of these genes?

# Plot feature plots

defining_genes<-c( "CTSB", "VMO1", "CD79A", "PRF1")

FeaturePlot(immune.combined, features = defining_genes, max.cutoff = 3, cols = c("grey", "red"), reduction='umap')

# Plot violin plots
VlnPlot(immune.combined, features = defining_genes, group.by = "seurat_annotations", pt.size = 0, combine=FALSE)

VlnPlot(immune.combined, features = defining_genes, split.by = "stim", group.by = "seurat_annotations", pt.size = 0, combine=FALSE)

From our FeaturePlot, we can see that each of the defining genes clusters fairly cleanly, and is localized to the same cluster region as its associated cell type (see earlier DimPlot results in Question #1). The genes with the most bleed-over into other cell-type cluster regions are CTSB (picked as the defining gene for CD14 Mono) and VMO1 (CD16 Mono). This is not surprising, as an examination of the heatmap will show that the defining genes for these cell types are less distinguished/unique.

From the violin plots, we see that the distribution of each gene supports choosing that gene as defining for the cell type, since we can see that the gene’s average expression is highest for that cell type, and that more cells of that type express highly than cells of other types. As with our FeaturePlot, we can see that CTSB has the most representation among other cell types, although its expression is still highest for CD14 Mono. Again, this is not unexpected, as we could see this quality from the heatmap as well; however, the violin plots make it easier to see.

From the split violin plots, we see that for most of these 4 cell types, there is not much expression difference between the STIM and CTRL samples. For example, looking at VMO1, we see that its distribution within the CD16 Mono cells is fairly similar in most sample groups. Only PRF1 appears to show a significant difference between the samples, with the gene expressing more highly in the STIM group. There are a few other noticeable differences between samples, but they occur in cell types other than the primary one for that defining gene; for example, CTSB (CD14 Mono) shows a difference between samples within the Mk cell type, with CTRL samples showing expression but STIM showing none. However, wherever this occurs, the distribution is generally thin and not very highly expressed.

5. Differential expression

What genes are differentially expressed between STIM and CTRL in CD4 Naive T?
What genes are differentially expressed between STIM and CTRL in CD4 Memory T? Hint: set the Ident of the seurat object to celltype.stim and check out the group names. table function would help.
Filter each gene list for genes with adj p<0.05

# Create new group names for each cell type + sample type
immune.combined$celltype.stim <- paste(immune.combined$seurat_annotations, immune.combined$stim, sep = "_")
Idents(immune.combined) <- "celltype.stim"

# Find diff expressed genes for CD4 Naive T, and filter results by p_val_adj
CD4_Naive_T.interferon.response <- FindMarkers(immune.combined, ident.1 = "CD4 Naive T_STIM", ident.2 = "CD4 Naive T_CTRL", verbose = FALSE)
CD4_Naive_T.interferon.response<-subset(CD4_Naive_T.interferon.response, CD4_Naive_T.interferon.response$p_val_adj<0.05)

# Find diff expressed genes for CD4 Memory T, and filter results by p_val_adj
CD4_Memory_T.interferon.response <- FindMarkers(immune.combined, ident.1 = "CD4 Memory T_STIM", ident.2 = "CD4 Memory T_CTRL", verbose = FALSE)
CD4_Memory_T.interferon.response<-subset(CD4_Memory_T.interferon.response, CD4_Memory_T.interferon.response$p_val_adj<0.05)

Below we display the top results for genes differentially expressed between STIM and CTRL groups of CD4 Naive T cells.

head(CD4_Naive_T.interferon.response, n = 15)

##                 p_val avg_log2FC pct.1 pct.2     p_val_adj
## IFIT3    0.000000e+00  7.2124244 0.893 0.006  0.000000e+00
## IFI6     0.000000e+00  5.4637641 0.938 0.065  0.000000e+00
## ISG15    0.000000e+00  4.8046176 0.993 0.167  0.000000e+00
## ISG20    0.000000e+00  2.8760899 0.976 0.459  0.000000e+00
## IFIT1   6.396139e-307  6.6790010 0.844 0.015 8.988494e-303
## LY6E    2.776011e-289  3.7562475 0.893 0.148 3.901129e-285
## MX1     1.918617e-266  4.7097979 0.812 0.069 2.696232e-262
## B2M     3.947113e-228  0.5671462 1.000 1.000 5.546878e-224
## IFIT2   2.194799e-193  6.0286696 0.622 0.011 3.084351e-189
## OAS1    1.255347e-183  5.8122678 0.609 0.018 1.764139e-179
## IFI44L  2.667595e-176  5.1263445 0.603 0.025 3.748771e-172
## RSAD2   9.543568e-165  7.2039563 0.550 0.006 1.341158e-160
## TNFSF10 4.717738e-163  5.0409892 0.583 0.036 6.629837e-159
## MT2A    3.670505e-159  3.9699064 0.610 0.063 5.158160e-155
## IRF7    6.542332e-149  3.5151829 0.604 0.071 9.193940e-145

Below we display the top results for genes differentially expressed between STIM and CTRL groups of CD4 Memory T cells.

head(CD4_Memory_T.interferon.response, n = 15)

##                p_val avg_log2FC pct.1 pct.2     p_val_adj
## ISG15  3.956957e-292  4.8434281 0.990 0.207 5.560712e-288
## IFI6   5.259678e-288  5.8020773 0.950 0.064 7.391426e-284
## ISG20  1.999836e-249  3.0613299 0.982 0.485 2.810369e-245
## IFIT1  2.098275e-237  7.3400262 0.821 0.010 2.948705e-233
## IFIT3  1.217135e-233  6.2277021 0.819 0.016 1.710440e-229
## LY6E   4.097363e-221  4.2479642 0.858 0.102 5.758024e-217
## MX1    2.148761e-215  4.4438645 0.838 0.092 3.019654e-211
## MT2A   2.044891e-172  4.5130615 0.721 0.066 2.873686e-168
## B2M    2.042452e-162  0.6090571 1.000 1.000 2.870258e-158
## IFIT2  5.308236e-157  5.5483776 0.623 0.013 7.459664e-153
## SAT1   8.217906e-143  2.7320510 0.828 0.288 1.154862e-138
## OAS1   2.668874e-135  4.3024350 0.610 0.045 3.750569e-131
## IFI44L 2.291354e-133  4.1829031 0.607 0.047 3.220039e-129
## RSAD2  6.295011e-124  5.0416701 0.523 0.012 8.846379e-120
## PLSCR1 5.861481e-112  4.5378060 0.517 0.033 8.237139e-108

6. Functional enrichment

Using the tables made in Question 4 of genes different between STIM and CTRL in CD4 Memory T and in CD4 Naive T, do functional enrichment analysis using enrichGO from clusterProfiler (look back at our RNAseq diff expr homework). Pick a gene term type (CC, MF, BP, or ALL) and set a qvalueCutoff. Species here is human

Gene symbols are stored as rownames and it is ok to use them in this instance (since the table were made with FindMarkers and not FindAllMarkers)

Generate a functional enrichment plot of your choice and describe the results. Are the top enriched terms shared or unique?

OPTIONAL: For an extra challenge, use compareCluster to compare the two gene lists.

library(clusterProfiler)
library(enrichplot)
library(org.Hs.eg.db)
library(ggplot2)
library(stringr)

#Over-representation test
CD4_Naive_T.ego <- enrichGO(gene = rownames(CD4_Naive_T.interferon.response), 
                OrgDb = org.Hs.eg.db, ont = "MF",
                pAdjustMethod = "BH",
                qvalueCutoff  = 0.05,
                keyType="SYMBOL")

CD4_Memory_T.ego <- enrichGO(gene = rownames(CD4_Memory_T.interferon.response), 
                OrgDb = org.Hs.eg.db, ont = "MF",
                pAdjustMethod = "BH",
                qvalueCutoff  = 0.05,
                keyType="SYMBOL")

#simplify output from enrichGO by removing redundancy of enriched GO terms
CD4_Naive_T.ego.simple<-simplify(CD4_Naive_T.ego, cutoff = 0.7, by = "p.adjust", select_fun = min,
                     measure = "Wang", semData = NULL )
CD4_Memory_T.ego.simple<-simplify(CD4_Memory_T.ego, cutoff = 0.7, by = "p.adjust", select_fun = min,
                     measure = "Wang", semData = NULL )

dotplot(CD4_Naive_T.ego.simple, showCategory=10, title="CD4 Naive T")

dotplot(CD4_Memory_T.ego.simple, showCategory=10, title="CD4 Memory T")

From the dotplots of our functional enrichment analysis, we see that both CD4 Naive T cells and CD4 Memory T cells are similar in their top enriched terms; however, each also shows at least one unique term (for example, “tumor necrosis factor…” for CD4 Naive T), though these unique terms are less enriched. Both share the same top 4 enriched terms, although in slightly different order; however, the top enriched term is the same for each – “structural constituent of ribosome” – and we can tell by its color, size, and position that it is highly enriched, and is significantly more enriched than the next top terms.

Examining these top enriched terms overall, we see a strong involvement with molecular functions related to RNA/ribosomal function, and with functions related to regulation of gene expression and signal transduction. These enrichments make sense for our interferon-stimulated cells, biologically speaking, as we know that interferon stimulates immune response and these are all terms we would expect to see enriched in an immune response.